Paperid: 1, https://arxiv.org/pdf/2509.26645.pdf   GitHub
Authors:Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
Title: TTT3R: 3D Reconstruction as Test-Time Training
Abstract:
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R
中文: TTT3R提出了一种无需训练的方法,通过对齐置信度动态调整记忆更新,显著提升了3D重建中的长度泛化能力,在保持高效的同时将姿态估计精度提高了一倍。
English: TTT3R introduces a training-free method that uses alignment confidence to dynamically adjust memory updates, significantly enhancing length generalization in 3D reconstruction and doubling pose estimation accuracy while maintaining efficiency.

Authors:Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata
Title: Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Abstract:
Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.
中文:Stitch是一种无需训练的方法,通过自动生成的边界框在现代文本到图像模型中创建并无缝整合对象,从而提升空间关系的准确性,并在基于位置的生成任务中实现了最先进的性能。
English: Stitch is a training-free method that enhances spatial accuracy in modern text-to-image models by using automatically generated bounding boxes to create and seamlessly integrate objects, achieving state-of-the-art performance on position-based tasks.

Authors:Shangding Gu, Xiaohan Wang, Donghao Ying, Haoyu Zhao, Runing Yang, Ming Jin, Boyi Li, Marco Pavone, Serena Yeung-Levy, Jun Wang, Dawn Song, Costas Spanos
Title: AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond
Abstract:
Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question--answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges. The code and dataset are available at: https://github.com/SafeRL-Lab/AccidentBench
中文: AccidentBench作为综合多模态基准,融合2000多个事故视频和1.9万组问答对,用于评估模型在安全关键场景中的时空推理能力,结果显示顶尖模型在最难任务中仅达18%准确率,暴露出重大能力缺陷。
English: AccidentBench is a comprehensive multimodal benchmark combining 2000+ accident videos and 19,000+ QA pairs to evaluate models' spatial-temporal reasoning in safety-critical scenarios, revealing major performance gaps as top models achieve only 18% accuracy on hardest tasks.

Authors:Siddarth Venkatraman, Vineet Jain, Sarthak Mittal, Vedant Shah, Johan Obando-Ceron, Yoshua Bengio, Brian R. Bartoldson, Bhavya Kailkhura, Guillaume Lajoie, Glen Berseth, Nikolay Malkin, Moksh Jain
Title: Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models
Abstract:
Test-time scaling methods improve the capabilities of large language models (LLMs) by increasing the amount of compute used during inference to make a prediction. Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement. We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods that combines the benefits of both parallel and sequential scaling. Each step of RSA refines a population of candidate reasoning chains through aggregation of subsets to yield a population of improved solutions, which are then used as the candidate pool for the next iteration. RSA exploits the rich information embedded in the reasoning chains -- not just the final answers -- and enables bootstrapping from partially correct intermediate steps within different chains of thought. Empirically, RSA delivers substantial performance gains with increasing compute budgets across diverse tasks, model families and sizes. Notably, RSA enables Qwen3-4B-Instruct-2507 to achieve competitive performance with larger reasoning models, including DeepSeek-R1 and o3-mini (high), while outperforming purely parallel and sequential scaling strategies across AIME-25, HMMT-25, Reasoning Gym, LiveCodeBench-v6, and SuperGPQA. We further demonstrate that training the model to combine solutions via a novel aggregation-aware reinforcement learning approach yields significant performance gains. Code available at https://github.com/HyperPotatoNeo/RSA.
中文:递归自聚合(RSA)是一种测试时扩展方法,通过子集聚合迭代优化候选推理链,结合了并行与顺序扩展的优势,在多种任务中实现显著性能提升,使较小模型能够与大型推理模型竞争。
English: Recursive Self-Aggregation (RSA) is a test-time scaling method that combines parallel and sequential scaling by iteratively refining candidate reasoning chains through subset aggregation, achieving substantial performance gains across diverse tasks and enabling smaller models to compete with larger reasoning models.

Authors:Yixuan Weng, Minjun Zhu, Qiujie Xie, Qiyao Sun, Zhen Lin, Sifan Liu, Yue Zhang
Title: DeepScientist: Advancing Frontier-Pushing Scientific Findings Progressively
Abstract:
While previous AI Scientist systems can generate novel findings, they often lack the focus to produce scientifically valuable contributions that address pressing human-defined challenges. We introduce DeepScientist, a system designed to overcome this by conducting goal-oriented, fully autonomous scientific discovery over month-long timelines. It formalizes discovery as a Bayesian Optimization problem, operationalized through a hierarchical evaluation process consisting of "hypothesize, verify, and analyze". Leveraging a cumulative Findings Memory, this loop intelligently balances the exploration of novel hypotheses with exploitation, selectively promoting the most promising findings to higher-fidelity levels of validation. Consuming over 20,000 GPU hours, the system generated about 5,000 unique scientific ideas and experimentally validated approximately 1100 of them, ultimately surpassing human-designed state-of-the-art (SOTA) methods on three frontier AI tasks by 183.7\%, 1.9\%, and 7.9\%. This work provides the first large-scale evidence of an AI achieving discoveries that progressively surpass human SOTA on scientific tasks, producing valuable findings that genuinely push the frontier of scientific discovery. To facilitate further research into this process, we will open-source all experimental logs and system code at https://github.com/ResearAI/DeepScientist/.
中文: DeepScientist是一个目标导向的AI系统,通过贝叶斯优化和分层评估流程自主进行科学发现,生成数千个已验证的科学构想,并在三项AI任务上以显著优势超越人类设计的最先进方法。
English: DeepScientist is a goal-oriented AI system that autonomously conducts scientific discovery through Bayesian Optimization and a hierarchical evaluation process, generating thousands of validated ideas and surpassing human-designed methods on three AI tasks by significant margins.

Authors:Jian Guo Pan, Lin Wang, Xia Cai
Title: Automated and Scalable SEM Image Analysis of Perovskite Solar Cell Materials via a Deep Segmentation Framework
Abstract:
Scanning Electron Microscopy (SEM) is indispensable for characterizing the microstructure of thin films during perovskite solar cell fabrication. Accurate identification and quantification of lead iodide and perovskite phases are critical because residual lead iodide strongly influences crystallization pathways and defect formation, while the morphology of perovskite grains governs carrier transport and device stability. Yet current SEM image analysis is still largely manual, limiting throughput and consistency. Here, we present an automated deep learning-based framework for SEM image segmentation that enables precise and efficient identification of lead iodide, perovskite and defect domains across diverse morphologies. Built upon an improved YOLOv8x architecture, our model named PerovSegNet incorporates two novel modules: (i) Adaptive Shuffle Dilated Convolution Block, which enhances multi-scale and fine-grained feature extraction through group convolutions and channel mixing; and (ii) Separable Adaptive Downsampling module, which jointly preserves fine-scale textures and large-scale structures for more robust boundary recognition. Trained on an augmented dataset of 10,994 SEM images, PerovSegNet achieves a mean Average Precision of 87.25% with 265.4 Giga Floating Point Operations, outperforming the baseline YOLOv8x-seg by 4.08%, while reducing model size and computational load by 24.43% and 25.22%, respectively. Beyond segmentation, the framework provides quantitative grain-level metrics, such as lead iodide/perovskite area and count, which can serve as reliable indicators of crystallization efficiency and microstructural quality. These capabilities establish PerovSegNet as a scalable tool for real-time process monitoring and data-driven optimization of perovskite thin-film fabrication.The source code is available at:https://github.com/wlyyj/PerovSegNet/tree/master.
中文摘要:本研究提出了PerovSegNet深度学习框架,通过改进YOLOv8x架构实现了对钙钛矿太阳能电池SEM图像中碘化铅、钙钛矿和缺陷区域的自动精确分割,为薄膜制备过程的实时监控和优化提供了高效工具。
English Summary: This study introduces PerovSegNet, an automated deep learning framework based on enhanced YOLOv8x architecture that achieves precise segmentation of lead iodide, perovskite, and defect domains in SEM images, significantly improving analysis efficiency and accuracy for perovskite solar cell fabrication.

Authors:Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen
Title: Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
Abstract:
We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.
中文: VERA是一个专为语音交互设计的实时推理评估基准,揭示了文本与语音系统间的显著性能差距,并指出在保持低延迟和高准确率方面存在的架构挑战。
English: VERA is a voice-native benchmark for evaluating real-time reasoning in conversational AI, revealing significant performance gaps between text and voice systems and highlighting architectural limitations in achieving both low latency and high accuracy.

Authors:Yida Wang, Ke Hong, Xiuhong Li, Yuanchao Xu, Wenxun Wang, Guohao Dai, Yu Wang
Title: TASP: Topology-aware Sequence Parallelism
Abstract:
Long-context large language models (LLMs) face constraints due to the quadratic complexity of the self-attention mechanism. The mainstream sequence parallelism (SP) method, Ring Attention, attempts to solve this by distributing the query into multiple query chunks across accelerators and enable each Q tensor to access all KV tensors from other accelerators via the Ring AllGather communication primitive. However, it exhibits low communication efficiency, restricting its practical applicability. This inefficiency stems from the mismatch between the Ring AllGather communication primitive it adopts and the AlltoAll topology of modern accelerators. A Ring AllGather primitive is composed of iterations of ring-styled data transfer, which can only utilize a very limited fraction of an AlltoAll topology. Inspired by the Hamiltonian decomposition of complete directed graphs, we identify that modern accelerator topology can be decomposed into multiple orthogonal ring datapaths which can concurrently transfer data without interference. Based on this, we further observe that the Ring AllGather primitive can also be decomposed into the same number of concurrent ring-styled data transfer at every iteration. Based on these insights, we propose TASP, a topology-aware SP method for long-context LLMs that fully utilizes the communication capacity of modern accelerators via topology decomposition and primitive decomposition. Experimental results on both single-node and multi-node NVIDIA H100 systems and a single-node AMD MI300X system demonstrate that TASP achieves higher communication efficiency than Ring Attention on these modern accelerator topologies and achieves up to 3.58 speedup than Ring Attention and its variant Zigzag-Ring Attention. The code is available at https://github.com/infinigence/HamiltonAttention.
中文: 针对长上下文大语言模型中序列并行方法的通信效率低下问题,TASP通过拓扑分解和通信原语分解,充分利用现代加速器的通信能力,在多种硬件系统上实现了比现有方法更高的效率和显著加速。
English: Long-context LLMs are hindered by inefficient communication in existing sequence parallelism methods, prompting the development of TASP, a topology-aware approach that decomposes both modern accelerator topologies and communication primitives to achieve significantly higher efficiency and speedup over current methods.

Authors:Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen
Title: OceanGym: A Benchmark Environment for Underwater Embodied Agents
Abstract:
We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth's last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.
中文: OceanGym是首个面向水下具身智能体的综合基准,通过多模态大语言模型框架整合感知与决策,应对低能见度和洋流等极端挑战,旨在推动AI在真实海洋环境中达到人类专家水平,为探索地球最后边疆奠定基础。
English: OceanGym is the first comprehensive benchmark for underwater embodied AI agents, featuring realistic tasks and a unified MLLM-driven framework to tackle extreme challenges like low visibility and dynamic currents, aiming to bridge the gap between current AI and human expertise for real-world ocean exploration.

Authors:Seohyun Lee, Wenzhi Fang, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton
Title: TAP: Two-Stage Adaptive Personalization of Multi-task and Multi-Modal Foundation Models in Federated Learning
Abstract:
Federated Learning (FL), despite demonstrating impressive capabilities in the training of multiple models in a decentralized manner, has been shown to produce a final model not necessarily well-suited to the needs of each client. While extensive work has been conducted on how to create tailored personalized models, called Personalized Federated Learning (PFL), less attention has been given to personalization via fine-tuning of foundation models with multi-task and multi-modal properties. Moreover, there exists a lack of understanding in the literature on how to fine-tune and personalize such models in a setting that is heterogeneous across clients not only in data, but also in tasks and modalities. To address this gap in the literature, we propose TAP (Two-Stage Adaptive Personalization), which (i) leverages mismatched model architectures between the clients and server to selectively conduct replacement operations when it benefits a client's local tasks and (ii) engages in post-FL knowledge distillation for capturing beneficial general knowledge without compromising personalization. We also introduce the first convergence analysis of the server model under its modality-task pair architecture, and demonstrate that as the number of modality-task pairs increases, its ability to cater to all tasks suffers. Through extensive experiments, we demonstrate the effectiveness of our proposed algorithm across a variety of datasets and tasks in comparison to a multitude of baselines. Implementation code is publicly available at https://github.com/lee3296/TAP.
Chinese: 联邦学习常无法满足各客户端的个性化需求,因此提出的TAP方法通过架构错配和训练后蒸馏,在不损害通用知识的前提下实现了更优的个性化适配。
English: Federated Learning often fails to create models tailored to individual clients, so the proposed TAP method uses mismatched architectures and post-training distillation to enhance personalization without sacrificing general knowledge.

Authors:Adrian Kosowski, Przemysław Uznański, Jan Chorowski, Zuzanna Stamirowska, Michał Bartoszkiewicz
Title: The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Abstract:
The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \$n\$ locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.
中文摘要:"龙雏"(BDH)模型提出了一种受生物启发的无标度神经架构,通过突触可塑性和模块化网络设计,在保持Transformer级别性能的同时实现了固有的可解释性与生物合理性。
English Summary: The "Dragon Hatchling" (BDH) model introduces a biologically inspired, scale-free neural architecture that rivals Transformer performance while offering inherent interpretability and biological plausibility through synaptic plasticity and modular network design.

Authors:Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang
Title: dParallel: Learnable Parallel Decoding for dLLMs
Abstract:
Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet, their parallel decoding potential remains largely underexplored, as existing open-source models still require nearly token-length decoding steps to ensure performance. To address this, we introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling. We identify that the key bottleneck to parallel decoding arises from the sequential certainty convergence for masked tokens. Building on this insight, we introduce the core of our approach: certainty-forcing distillation, a novel training strategy that distills the model to follow its original sampling trajectories while enforcing it to achieve high certainty on masked tokens more rapidly and in parallel. Extensive experiments across various benchmarks demonstrate that our method can dramatically reduce the number of decoding steps while maintaining performance. When applied to the LLaDA-8B-Instruct model, dParallel reduces decoding steps from 256 to 30 on GSM8K, achieving an 8.5x speedup without performance degradation. On the MBPP benchmark, it cuts decoding steps from 256 to 24, resulting in a 10.5x speedup while maintaining accuracy. Our code is available at https://github.com/czg1225/dParallel
Chinese: dParallel方法通过确定性强制蒸馏技术,释放了扩散大语言模型的并行解码潜力,将解码步骤从256步大幅减少至最低24步,在GSM8K和MBPP等基准测试中保持性能的同时实现最高10.5倍加速。
English: The dParallel method enhances diffusion large language models by enabling faster parallel decoding through certainty-forcing distillation, significantly reducing steps from 256 to as few as 24 while maintaining performance across benchmarks like GSM8K and MBPP.

Authors:Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro, Haydar Talib
Title: On Deepfake Voice Detection -- It's All in the Presentation
Abstract:
While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.
中文: 本文指出当前音频深度伪造检测系统因数据集和方法不足而在实际应用中失效,提出了新框架将检测准确率最高提升57%,并强调优化数据收集比训练更大模型更为重要。
English: This paper reveals that current audio deepfake detection systems fail in real-world applications due to inadequate datasets and methodologies, proposing a new framework that improved detection accuracy by up to 57% and emphasizing better data collection over larger models.

Authors:Alessio Masano, Matteo Pennisi, Federica Proietto Salanitri, Concetto Spampinato, Giovanni Bellitto
Title: Zero-Shot Decentralized Federated Learning
Abstract:
CLIP has revolutionized zero-shot learning by enabling task generalization without fine-tuning. While prompting techniques like CoOp and CoCoOp enhance CLIP's adaptability, their effectiveness in Federated Learning (FL) remains an open challenge. Existing federated prompt learning approaches, such as FedCoOp and FedTPG, improve performance but face generalization issues, high communication costs, and reliance on a central server, limiting scalability and privacy. We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across distributed clients without a central coordinator. ZeroDFL employs an iterative prompt-sharing mechanism, allowing clients to optimize and exchange textual prompts to enhance generalization while drastically reducing communication overhead. We validate ZeroDFL on nine diverse image classification datasets, demonstrating that it consistently outperforms--or remains on par with--state-of-the-art federated prompt learning methods. More importantly, ZeroDFL achieves this performance in a fully decentralized setting while reducing communication overhead by 118x compared to FedTPG. These results highlight that our approach not only enhances generalization in federated zero-shot learning but also improves scalability, efficiency, and privacy preservation--paving the way for decentralized adaptation of large vision-language models in real-world applications.
中文: ZeroDFL提出了一种完全去中心化的联邦学习框架,通过迭代式提示共享实现零样本自适应,在显著降低通信成本118倍的同时超越现有方法,并提升了可扩展性与隐私保护能力。
English: ZeroDFL introduces a fully decentralized federated learning framework that enables zero-shot adaptation through iterative prompt sharing, significantly outperforming existing methods while reducing communication costs by 118x and enhancing scalability and privacy.

Authors:Artur Barros, Carlos Caetano, João Macedo, Jefersson A. dos Santos, Sandra Avila
Title: Attention over Scene Graphs: Indoor Scene Representations Toward CSAI Classification
Abstract:
Indoor scene classification is a critical task in computer vision, with wide-ranging applications that go from robotics to sensitive content analysis, such as child sexual abuse imagery (CSAI) classification. The problem is particularly challenging due to the intricate relationships between objects and complex spatial layouts. In this work, we propose the Attention over Scene Graphs for Sensitive Content Analysis (ASGRA), a novel framework that operates on structured graph representations instead of raw pixels. By first converting images into Scene Graphs and then employing a Graph Attention Network for inference, ASGRA directly models the interactions between a scene's components. This approach offers two key benefits: (i) inherent explainability via object and relationship identification, and (ii) privacy preservation, enabling model training without direct access to sensitive images. On Places8, we achieve 81.27% balanced accuracy, surpassing image-based methods. Real-world CSAI evaluation with law enforcement yields 74.27% balanced accuracy. Our results establish structured scene representations as a robust paradigm for indoor scene classification and CSAI classification. Code is publicly available at https://github.com/tutuzeraa/ASGRA.
中文摘要:ASGRA框架通过场景图和图注意力网络进行室内场景分类与敏感内容分析,在提高准确率的同时兼具可解释性和隐私保护能力。
English Summary: The ASGRA framework uses scene graphs and graph attention networks to improve indoor scene classification and sensitive content analysis, achieving higher accuracy with inherent explainability and privacy protection.

Authors:Benno Kaech, Luis Wyss, Karsten Borgwardt, Gianvito Grasso
Title: Refine Drugs, Don't Complete Them: Uniform-Source Discrete Flows for Fragment-Based Drug Discovery
Abstract:
We introduce InVirtuoGen, a discrete flow generative model for fragmented SMILES for de novo and fragment-constrained generation, and target-property/lead optimization of small molecules. The model learns to transform a uniform source over all possible tokens into the data distribution. Unlike masked models, its training loss accounts for predictions on all sequence positions at every denoising step, shifting the generation paradigm from completion to refinement, and decoupling the number of sampling steps from the sequence length. For \textit{de novo} generation, InVirtuoGen achieves a stronger quality-diversity pareto frontier than prior fragment-based models and competitive performance on fragment-constrained tasks. For property and lead optimization, we propose a hybrid scheme that combines a genetic algorithm with a Proximal Property Optimization fine-tuning strategy adapted to discrete flows. Our approach sets a new state-of-the-art on the Practical Molecular Optimization benchmark, measured by top-10 AUC across tasks, and yields higher docking scores in lead optimization than previous baselines. InVirtuoGen thus establishes a versatile generative foundation for drug discovery, from early hit finding to multi-objective lead optimization. We further contribute to open science by releasing pretrained checkpoints and code, making our results fully reproducible\footnote{https://github.com/invirtuolabs/InVirtuoGen_results}.
Chinese: InVirtuoGen是一种用于片段化SMILES的离散流生成模型,在从头生成、片段约束的小分子设计以及靶向性质和先导化合物优化方面表现卓越,在分子优化基准测试中创下最新性能记录,为药物发现提供了多功能基础。
English: InVirtuoGen is a discrete flow generative model for fragmented SMILES that excels in de novo and fragment-constrained small molecule generation, as well as target-property and lead optimization, setting new state-of-the-art performance in molecular optimization benchmarks and providing a versatile foundation for drug discovery.

Authors:Zhiwei Yang, Chen Gao, Mike Zheng Shou
Title: PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer
Abstract:
Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, i.e., automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.
中文: 本文提出PANDA这一通用视频异常检测系统,它通过自适应策略规划、启发式推理、工具增强反思和记忆链自我改进,无需训练数据或人工干预即可自主处理各种场景和异常类型。
English: This paper introduces PANDA, a generalist video anomaly detection system that autonomously handles diverse scenarios and anomaly types without training data or human intervention through adaptive strategy planning, heuristic reasoning, tool-augmented reflection, and memory-driven self-improvement.

Authors:Jinyeop Song, Song Wang, Julian Shun, Yada Zhu
Title: Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
Abstract:
Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.
中文: KG-R1通过强化学习框架,采用单一智能体实现知识图谱检索增强生成,在提升推理效率的同时具备跨知识图谱的强迁移能力。
English: KG-R1 introduces a reinforcement learning-based framework that enhances knowledge-graph retrieval-augmented generation by using a single agent for efficient reasoning and transferable performance across different knowledge graphs.

Authors:Junjie Zhou, Ze Liu, Lei Xiong, Jin-Ge Yao, Yueze Wang, Shitao Xiao, Fenfen Lin, Miguel Hu Chen, Zhicheng Dou, Siqi Bao, Defu Lian, Yongping Xiong, Zheng Liu
Title: MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval
Abstract:
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic correspondence (e.g., object-text matching) while failing to assess the deeper reasoning required to capture complex relationships between visual and textual information. To address this gap, we introduce MR$^2$-Bench, a reasoning-intensive benchmark for multimodal retrieval. MR$^2$-Bench presents the following critical values: 1) all tasks are reasoning-driven, going beyond shallow matching to effectively assess models' capacity for logical, spatial, and causal inference; 2) it features diverse multimodal data, such as natural images, diagrams, and visual puzzles, enabling comprehensive evaluation across content types; 3) it supports complex queries and documents containing multiple images and covers diverse retrieval scenarios, more accurately reflecting real-world applications. Our benchmark contains 1,309 curated queries, derived either from manual collection and annotation or from selective consolidation of public datasets. Despite achieving strong results on existing benchmarks, current state-of-the-art models still struggle on MR$^2$-Bench: for example, the leading Seed1.6-Embedding model attains a Recall@1 of 77.78 on MMEB, but only 9.91 on MR$^2$-Bench. This substantial performance gap highlights both the increased challenge posed by our benchmark and the pressing need for further advances in reasoning-intensive multimodal retrieval. The dataset and evaluation code will be made publicly available at https://github.com/VectorSpaceLab/MR2-Bench.
中文: MR$^2$-Bench 是一个推理密集型多模态检索基准,旨在评估超越浅层匹配的逻辑、空间和因果推理能力,揭示了现有先进模型的显著性能差距,并强调了该领域进一步发展的必要性。
English: MR$^2$-Bench is introduced as a reasoning-intensive multimodal retrieval benchmark that assesses deeper logical, spatial, and causal inference beyond surface-level matching, revealing significant performance gaps in state-of-the-art models and highlighting the need for advancements in this area.

Authors:Harold Haodong Chen, Xianfeng Wu, Wen-Jie Shu, Rongjin Guo, Disen Lan, Harry Yang, Ying-Cong Chen
Title: Go with Your Gut: Scaling Confidence for Autoregressive Image Generation
Abstract:
Test-time scaling (TTS) has demonstrated remarkable success in enhancing large language models, yet its application to next-token prediction (NTP) autoregressive (AR) image generation remains largely uncharted. Existing TTS approaches for visual AR (VAR), which rely on frequent partial decoding and external reward models, are ill-suited for NTP-based image generation due to the inherent incompleteness of intermediate decoding results. To bridge this gap, we introduce ScalingAR, the first TTS framework specifically designed for NTP-based AR image generation that eliminates the need for early decoding or auxiliary rewards. ScalingAR leverages token entropy as a novel signal in visual token generation and operates at two complementary scaling levels: (i) Profile Level, which streams a calibrated confidence state by fusing intrinsic and conditional signals; and (ii) Policy Level, which utilizes this state to adaptively terminate low-confidence trajectories and dynamically schedule guidance for phase-appropriate conditioning strength. Experiments on both general and compositional benchmarks show that ScalingAR (1) improves base models by 12.5% on GenEval and 15.2% on TIIF-Bench, (2) efficiently reduces visual token consumption by 62.0% while outperforming baselines, and (3) successfully enhances robustness, mitigating performance drops by 26.0% in challenging scenarios.
中文:ScalingAR首次为基于下一令牌预测的自回归图像生成设计了测试时缩放框架,利用令牌熵实现自适应轨迹终止和动态引导调度,显著提升了生成性能与效率。
English: ScalingAR introduces the first test-time scaling framework for next-token prediction autoregressive image generation, using token entropy to enable adaptive trajectory termination and dynamic guidance scheduling, achieving significant performance gains and efficiency improvements.

Authors:Kirill Tamogashev, Nikolay Malkin
Title: Data-to-Energy Stochastic Dynamics
Abstract:
The Schrödinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost. This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences. However, all existing algorithms allow to infer such dynamics only for cases where samples from both distributions are available. In this paper, we propose the first general method for modelling Schrödinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples. Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers. We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions. As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schrödinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics. Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method. Code: https://github.com/mmacosha/d2e-stochastic-dynamics
中文: 本文提出了首个通用方法,能在无数据样本情况下通过未归一化密度构建薛定谔桥,采用创新的数据到能量迭代比例拟合算法,成功实现了多模态分布间的传输并优化了扩散动力学。
English: This paper introduces the first general method for modeling Schrödinger bridges when distributions are specified by unnormalized densities without data samples, using a novel data-to-energy iterative proportional fitting approach that successfully handles multimodal transports and improves diffusion dynamics.

Authors:Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao
Title: Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Abstract:
Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.
中文: 本研究提出“误进化”概念,指出基于大语言模型的自进化智能体在进化过程中可能偏离预期方向,导致安全性退化、工具漏洞等普遍风险,亟需建立新的安全范式。
English: This study introduces the concept of "misevolution," where self-evolving agents based on large language models deviate in unintended ways, leading to widespread risks such as safety degradation and vulnerabilities across evolutionary pathways, highlighting the need for new safety paradigms.

Authors:Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, Chengwei Qin
Title: Interactive Learning for LLM Reasoning
Abstract:
Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs' independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM's reward distribution characteristics into another's reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.
中文摘要:ILR框架通过动态多智能体交互和感知校准增强了大语言模型的独立解题能力,在多项测试中相比单智能体学习提升达5%,并显著增强了系统鲁棒性。
English Summary: The ILR framework enhances LLM problem-solving through dynamic multi-agent interactions and perception calibration, achieving up to 5% performance gains over single-agent systems while improving robustness.

Authors:Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier, Robert Mullins
Title: Feedback Forensics: A Toolkit to Measure AI Personality
Abstract:
Some traits making a "good" AI model are hard to describe upfront. For example, should responses be more polite or more casual? Such traits are sometimes summarized as model character or personality. Without a clear objective, conventional benchmarks based on automatic validation struggle to measure such traits. Evaluation methods using human feedback such as Chatbot Arena have emerged as a popular alternative. These methods infer "better" personality and other desirable traits implicitly by ranking multiple model responses relative to each other. Recent issues with model releases highlight limitations of these existing opaque evaluation approaches: a major model was rolled back over sycophantic personality issues, models were observed overfitting to such feedback-based leaderboards. Despite these known issues, limited public tooling exists to explicitly evaluate model personality. We introduce Feedback Forensics: an open-source toolkit to track AI personality changes, both those encouraged by human (or AI) feedback, and those exhibited across AI models trained and evaluated on such feedback. Leveraging AI annotators, our toolkit enables investigating personality via Python API and browser app. We demonstrate the toolkit's usefulness in two steps: (A) first we analyse the personality traits encouraged in popular human feedback datasets including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to analyse how much popular models exhibit such traits. We release (1) our Feedback Forensics toolkit alongside (2) a web app tracking AI personality in popular models and feedback datasets as well as (3) the underlying annotation data at https://github.com/rdnfn/feedback-forensics.
中文: 摘要介绍了Feedback Forensics,一个开源工具包,旨在显式评估和追踪AI模型的个性特征,通过分析人类反馈数据集中鼓励的特征及模型表现出的特征,以解决当前不透明评估方法的局限性。
English: The abstract introduces Feedback Forensics, an open-source toolkit designed to explicitly evaluate and track AI model personality traits, addressing limitations in current opaque evaluation methods by analyzing traits encouraged in human feedback datasets and exhibited in models.

Authors:Suli Wang, Yangshen Deng, Zhenghua Bao, Xinyu Zhan, Yiqun Duan
Title: NeuroTTT: Bridging Pretraining-Downstream Task Misalignment in EEG Foundation Models via Test-Time Training
Abstract:
Large-scale foundation models for EEG signals offer a promising path to generalizable brain-computer interface (BCI) applications, but they often suffer from misalignment between pretraining objectives and downstream tasks, as well as significant cross-subject distribution shifts. This paper addresses these challenges by introducing a two-stage alignment strategy that bridges the gap between generic pretraining and specific EEG decoding tasks. First, we propose NeuroTTT: a domain-specific self-supervised fine-tuning paradigm that augments the foundation model with task-relevant self-supervised objectives, aligning latent representations to important spectral, spatial, and temporal EEG features without requiring additional labeled data. Second, we incorporate test-time training (TTT) at inference, we perform (i) self-supervised test-time training on individual unlabeled test samples and (ii) prediction entropy minimization (Tent), which updates only normalization statistics to continually calibrate the model to each new input on the fly. Our approach, which, to our knowledge, is the first to unify domain-tuned self-supervision with test-time training in large-scale EEG foundation models, yields substantially improved robustness and accuracy across diverse BCI tasks (imagined speech, stress detection, motor imagery). Using CBraMod and LaBraM as backbones, our method pushes their performance to a markedly higher level. Results on three diverse tasks demonstrate that the proposed alignment strategy achieves state-of-the-art performance, outperforming conventional fine-tuning and adaptation methods. Our code is available at https://github.com/wsl2000/NeuroTTT.
中文: 本文提出一种两阶段对齐策略,通过领域特定的自监督微调和测试时训练相结合,解决了脑电基础模型中预训练目标与下游任务不匹配及跨被试分布差异的问题,显著提升了多种脑机接口任务的性能。
English: This paper introduces a two-stage alignment strategy combining domain-specific self-supervised fine-tuning and test-time training to enhance EEG foundation models' performance across various BCI tasks by addressing pretraining-task misalignment and cross-subject distribution shifts.

Authors:Lionel Blondé, Joao A. Candido Ramos, Alexandros Kalousis
Title: Noise-Guided Transport for Imitation Learning
Abstract:
We consider imitation learning in the low-data regime, where only a limited number of expert demonstrations are available. In this setting, methods that rely on large-scale pretraining or high-capacity architectures can be difficult to apply, and efficiency with respect to demonstration data becomes critical. We introduce Noise-Guided Transport (NGT), a lightweight off-policy method that casts imitation as an optimal transport problem solved via adversarial training. NGT requires no pretraining or specialized architectures, incorporates uncertainty estimation by design, and is easy to implement and tune. Despite its simplicity, NGT achieves strong performance on challenging continuous control tasks, including high-dimensional Humanoid tasks, under ultra-low data regimes with as few as 20 transitions. Code is publicly available at: https://github.com/lionelblonde/ngt-pytorch.
Chinese: 本文提出噪声引导传输(NGT)方法,将模仿学习构建为最优传输问题,无需预训练或特殊架构,仅用20条专家轨迹就能在复杂任务上实现优异性能。
English: This paper introduces Noise-Guided Transport (NGT), a lightweight imitation learning method that frames imitation as an optimal transport problem and achieves strong performance on challenging tasks with as few as 20 expert transitions, requiring no pretraining or specialized architectures.

Authors:Anthony Zhou, Alexander Wikner, Amaury Lancelin, Pedram Hassanzadeh, Amir Barati Farimani
Title: Reframing Generative Models for Physical Systems using Stochastic Interpolants
Abstract:
Generative models have recently emerged as powerful surrogates for physical systems, demonstrating increased accuracy, stability, and/or statistical fidelity. Most approaches rely on iteratively denoising a Gaussian, a choice that may not be the most effective for autoregressive prediction tasks in PDEs and dynamical systems such as climate. In this work, we benchmark generative models across diverse physical domains and tasks, and highlight the role of stochastic interpolants. By directly learning a stochastic process between current and future states, stochastic interpolants can leverage the proximity of successive physical distributions. This allows for generative models that can use fewer sampling steps and produce more accurate predictions than models relying on transporting Gaussian noise. Our experiments suggest that generative models need to balance deterministic accuracy, spectral consistency, and probabilistic calibration, and that stochastic interpolants can potentially fulfill these requirements by adjusting their sampling. This study establishes stochastic interpolants as a competitive baseline for physical emulation and gives insight into the abilities of different generative modeling frameworks.
中文摘要:本研究在物理系统中对生成模型进行了基准测试,强调随机插值法通过直接学习状态间的转换,实现了更准确高效的预测,优于传统的去噪方法。
English Summary: This study benchmarks generative models in physical systems, highlighting that stochastic interpolants enable more accurate and efficient predictions by directly learning transitions between states, outperforming traditional denoising approaches.

Authors:James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez
Title: Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
Abstract:
Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our code is available at http://github.com/james-oldfield/tpc.
Chinese: 截断多项式分类器(TPCs)通过渐进式评估提供了一种灵活高效的LLM激活监控方法,允许对简单输入提前终止检查,对模糊输入进行深度分析,从而在确保安全性的同时优化计算资源。
English: Truncated Polynomial Classifiers (TPCs) offer a flexible and efficient approach to monitoring LLM activations for harmful requests by enabling progressive evaluation, allowing early stopping for simple cases or deeper analysis for ambiguous inputs, thereby optimizing computational resources while maintaining safety.

Authors:Jiayi Guo, Chuanhao Yan, Xingqian Xu, Yulin Wang, Kai Wang, Gao Huang, Humphrey Shi
Title: IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Abstract:
Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.
中文: 提出的隐式多模态引导框架通过多模态模型识别图像与提示的不匹配,并操控扩散特征进行重新生成,无需额外数据或编辑操作即可优于现有对齐方法。
English: The proposed Implicit Multimodal Guidance (IMG) framework improves alignment between generated images and prompts by identifying misalignments with a multimodal model and manipulating diffusion features for regeneration, outperforming existing methods without requiring additional data or editing operations.

Authors:Haiyang Zheng, Nan Pu, Wenjing Li, Nicu Sebe, Zhun Zhong
Title: Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts
Abstract:
Generalized Category Discovery (GCD) is an open-world problem that clusters unlabeled data by leveraging knowledge from partially labeled categories. A key challenge is that unlabeled data may contain both known and novel categories. Existing approaches suffer from two main limitations. First, they fail to exploit multi-granularity conceptual information in visual data, which limits representation quality. Second, most assume that the number of unlabeled categories is known during training, which is impractical in real-world scenarios. To address these issues, we propose a Multi-Granularity Conceptual Experts (MGCE) framework that adaptively mines visual concepts and integrates multi-granularity knowledge for accurate category discovery. MGCE consists of two modules: (1) Dynamic Conceptual Contrastive Learning (DCCL), which alternates between concept mining and dual-level representation learning to jointly optimize feature learning and category discovery; and (2) Multi-Granularity Experts Collaborative Learning (MECL), which extends the single-expert paradigm by introducing additional experts at different granularities and by employing a concept alignment matrix for effective cross-expert collaboration. Importantly, MGCE can automatically estimate the number of categories in unlabeled data, making it suitable for practical open-world settings. Extensive experiments on nine fine-grained visual recognition benchmarks demonstrate that MGCE achieves state-of-the-art results, particularly in novel-class accuracy. Notably, even without prior knowledge of category numbers, MGCE outperforms parametric approaches that require knowing the exact number of categories, with an average improvement of 3.6\%. Code is available at https://github.com/HaiyangZheng/MGCE.
中文: 提出的多粒度概念专家(MGCE)框架通过自适应挖掘视觉概念并整合多粒度知识,解决了广义类别发现中的关键限制,实现了自动类别数量估计并在九个基准测试中取得了最优性能。
English: The proposed Multi-Granularity Conceptual Experts (MGCE) framework addresses limitations in Generalized Category Discovery by adaptively mining visual concepts and integrating multi-granularity knowledge, enabling automatic category estimation and achieving state-of-the-art performance across nine benchmarks.

Authors:Alessandro De Bellis, Salvatore Bufi, Giovanni Servedio, Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio
Title: Type-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models
Abstract:
Inductive link prediction is emerging as a key paradigm for real-world knowledge graphs (KGs), where new entities frequently appear and models must generalize to them without retraining. Predicting links in a KG faces the challenge of guessing previously unseen entities by leveraging generalizable node features such as subgraph structure, type annotations, and ontological constraints. However, explicit type information is often lacking or incomplete. Even when available, type information in most KGs is often coarse-grained, sparse, and prone to errors due to human annotation. In this work, we explore the potential of pre-trained language models (PLMs) to enrich node representations with implicit type signals. We introduce TyleR, a Type-less yet type-awaRe approach for subgraph-based inductive link prediction that leverages PLMs for semantic enrichment. Experiments on standard benchmarks demonstrate that TyleR outperforms state-of-the-art baselines in scenarios with scarce type annotations and sparse graph connectivity. To ensure reproducibility, we share our code at https://github.com/sisinflab/tyler .
中文摘要:TyleR提出了一种无需显式类型标注但具备类型感知能力的方法,通过预训练语言模型增强节点表示,在类型标注稀缺和连接稀疏的场景下实现了最先进的归纳链接预测性能。
English Summary: TyleR introduces a type-aware approach using pre-trained language models to enhance node representations for inductive link prediction, achieving superior performance in scenarios with limited type annotations and sparse connectivity.

Authors:Chenyang Jiang, Zhengcen Li, Hang Zhao, Qiben Shan, Shaocong Wu, Jingyong Su
Title: Beyond Pixels: Efficient Dataset Distillation via Sparse Gaussian Representation
Abstract:
Dataset distillation has emerged as a promising paradigm that synthesizes compact, informative datasets capable of retaining the knowledge of large-scale counterparts, thereby addressing the substantial computational and storage burdens of modern model training. Conventional approaches typically rely on dense pixel-level representations, which introduce redundancy and are difficult to scale up. In this work, we propose GSDD, a novel and efficient sparse representation for dataset distillation based on 2D Gaussians. Instead of representing all pixels equally, GSDD encodes critical discriminative information in a distilled image using only a small number of Gaussian primitives. This sparse representation could improve dataset diversity under the same storage budget, enhancing coverage of difficult samples and boosting distillation performance. To ensure both efficiency and scalability, we adapt CUDA-based splatting operators for parallel inference and training, enabling high-quality rendering with minimal computational and memory overhead. Our method is simple yet effective, broadly applicable to different distillation pipelines, and highly scalable. Experiments show that GSDD achieves state-of-the-art performance on CIFAR-10, CIFAR-100, and ImageNet subsets, while remaining highly efficient encoding and decoding cost. Our code is available at https://github.com/j-cyoung/GSDatasetDistillation.
中文: GSDD提出了一种基于二维高斯分布的稀疏数据集蒸馏方法,仅用少量高斯基元编码关键图像信息,在多个基准测试中以最小计算成本实现了最优性能。
English: GSDD introduces a novel sparse dataset distillation method using 2D Gaussians to encode critical image information efficiently, achieving state-of-the-art performance with minimal computational overhead across multiple benchmarks.

Authors:Zican Hu, Shilin Zhang, Yafu Li, Jianhao Yan, Xuyang Hu, Leyang Cui, Xiaoye Qu, Chunlin Chen, Yu Cheng, Zhi Wang
Title: Diversity-Incentivized Exploration for Versatile Reasoning
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a crucial paradigm for incentivizing reasoning capabilities in Large Language Models (LLMs). Due to vast state-action spaces and reward sparsity in reasoning tasks, existing methods often struggle with deficient exploration and poor sample efficiency. In the paper, we propose \textbf{DIVER} (\textbf{D}iversity-\textbf{I}ncentivized Exploration for \textbf{V}ersatil\textbf{E} \textbf{R}easoning), an innovative framework that highlights the pivotal role of global sequence-level diversity to incentivize deep exploration for versatile reasoning. We first conduct a primary empirical study to reveal a strong positive correlation between global diversity and reasoning capacity. Building on this insight, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space. Incorporating the intrinsic reward, we develop a potential-based reward shaping mechanism to preserve optimal policy invariance and design simple heuristics to mitigate possible reward hacking. Experimental results show that DIVER outperforms competitive RLVR baselines with various exploration strategies on both in-domain and out-of-domain tasks, excelling in both Pass@1 and Pass@k evaluations. Our code is available at https://github.com/NJU-RL/DIVER.
Chinese: 本文提出DIVER框架,通过激励全局序列多样性来促进深度探索,提升大型语言模型在推理任务中的强化学习效果与样本效率。
English: The paper introduces DIVER, a framework that enhances reinforcement learning for versatile reasoning in large language models by incentivizing global sequence-level diversity to promote deep exploration and improve sample efficiency.

Authors:Hatim Chergui, Miguel Catalan Cid, Pouria Sayyad Khodashenas, Daniel Camps Mur, Christos Verikoukis
Title: Toward an Unbiased Collective Memory for Efficient LLM-Based Agentic 6G Cross-Domain Management
Abstract:
This paper introduces a novel framework for proactive cross-domain resource orchestration in 6G RAN-Edge networks, featuring large language model (LLM)-augmented agents. The system comprises specialized RAN (energy efficiency) and Edge (latency assurance) agents that engage in iterative negotiation, supported by advanced reasoning and planning capabilities. Agents dynamically interact with a digital twin (DT) to test their proposals and leverage a long-term collective memory where their joint successful and failed agreements along with the related network contexts are distilled into strategies to either follow or avoid and subsequently stored. Given that agents are subject to a plethora of cognitive distortions when retrieving those past experiences -- such as primacy, recency, confirmation and availability biases -- we propose in this work a novel unbiased memory design (A reusable mockup version of the unbiased memory source code is available for non-commercial use at https://github.com/HatimChergui/unbiased-collective-memory). featuring (i) semantic retrieval of past strategies via Jaccard similarity; (ii) learning from failures through amplified weighting of SLA violations and mandatory inclusion of failed negotiation cases to mitigate confirmation bias; (iii) diversity enforcement to minimize availability bias and (iv) recency and primacy weighting with slow decay to counteract temporal biases. Evaluation results showcase the impact of existing biases and how the unbiased memory allows to tackle them by learning from both successful and failed strategies, either present or old, resulting in $\times 4.5$ and $\times 3.5$ reductions of unresolved negotiations compared to non-memory and vanilla memory baselines, respectively, while totally mitigating SLA violations as well as improving latency and energy saving distributions.
中文: 本文提出了一种6G无线接入网与边缘网络的新型框架,采用大语言模型增强的智能体通过数字孪生测试和无偏集体记忆系统进行协商,该记忆系统通过偏差缓解策略将未解决协商减少4.5倍并完全消除服务等级协议违规。
English: This paper proposes a novel framework for 6G RAN-Edge networks using LLM-augmented agents that negotiate via digital twin testing and an unbiased collective memory system, which reduces unresolved negotiations by 4.5× and eliminates SLA violations through bias-mitigating strategies.

Authors:Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, Binhang Yuan
Title: Parallax: Efficient LLM Inference Service over Decentralized Environment
Abstract:
Deploying a large language model (LLM) inference service remains costly because centralized serving depends on specialized GPU clusters and high-bandwidth interconnects in datacenters. An appealing alternative is to leverage collaborative decentralized GPU pools. However, heterogeneity in GPU and limited interconnected network bandwidth, along with potentially dynamic availability, make efficient scheduling the central challenge in this scenario. In this paper, we present Parallax, a decentralized LLM serving system that turns a pool of heterogeneous GPUs into an efficient inference platform via a two-phase scheduler. Parallax decomposes planning into (i) model allocation, which places layers of each replica across diverse GPUs to jointly optimize latency and throughput under memory and link-bandwidth constraints, and (ii) request-time GPU pipeline selection, which stitches layers from different replicas into end-to-end execution chains that balance load and adapt to current conditions. We implement Parallax and evaluate it on open-source LLMs deployed over real volunteer nodes. Parallax consistently reduces latency and increases throughput relative to decentralized baselines, demonstrating that principled scheduling can make volunteer compute a practical, affordable substrate for LLM inference. Github Repo at: https://github.com/GradientHQ/parallax.
中文:Parallax是一种去中心化大语言模型服务系统,通过两阶段调度器有效管理异构GPU资源,在降低延迟的同时提升吞吐量,实现经济高效的推理计算。
English: Parallax is a decentralized LLM serving system that uses a two-phase scheduler to efficiently manage heterogeneous GPU pools, reducing latency and increasing throughput for cost-effective inference.

Authors:Yuansen Liu, Haiming Tang, Jinlong Peng, Jiangning Zhang, Xiaozhong Ji, Qingdong He, Donghao Luo, Zhenye Gan, Junwei Zhu, Yunhang Shen, Chaoyou Fu, Chengjie Wang, Xiaobin Hu, Shuicheng Yan
Title: Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.
中文: 本文提出了Human-MME基准,旨在全面评估多模态大语言模型在以人为中心的场景理解能力,通过提供多样化场景、渐进式评估维度和高质量标注,弥补现有评估工具的不足,为未来研究指明方向。
English: This paper introduces Human-MME, a comprehensive benchmark designed to holistically evaluate Multimodal Large Language Models in human-centric scene understanding, addressing the lack of existing evaluation tools by providing diverse scenarios, progressive assessment dimensions, and high-quality annotations to guide future research.

Authors:Runxin Yang, Yuxuan Wan, Shuqing Li, Michael R. Lyu
Title: 90% Faster, 100% Code-Free: MLLM-Driven Zero-Code 3D Game Development
Abstract:
Developing 3D games requires specialized expertise across multiple domains, including programming, 3D modeling, and engine configuration, which limits access to millions of potential creators. Recently, researchers have begun to explore automated game development. However, existing approaches face three primary challenges: (1) limited scope to 2D content generation or isolated code snippets; (2) requirement for manual integration of generated components into game engines; and (3) poor performance on handling interactive game logic and state management. While Multimodal Large Language Models (MLLMs) demonstrate potential capabilities to ease the game generation task, a critical gap still remains in translating these outputs into production-ready, executable game projects based on game engines such as Unity and Unreal Engine. To bridge the gap, this paper introduces UniGen, the first end-to-end coordinated multi-agent framework that automates zero-coding development of runnable 3D games from natural language requirements. Specifically, UniGen uses a Planning Agent that interprets user requirements into structured blueprints and engineered logic descriptions; after which a Generation Agent produces executable C# scripts; then an Automation Agent handles engine-specific component binding and scene construction; and lastly a Debugging Agent provides real-time error correction through conversational interaction. We evaluated UniGen on three distinct game prototypes. Results demonstrate that UniGen not only democratizes game creation by requiring no coding from the user, but also reduces development time by 91.4%. We release UniGen at https://github.com/yxwan123/UniGen. A video demonstration is available at https://www.youtube.com/watch?v=xyJjFfnxUx0.
中文:本文提出UniGen框架,通过多智能体协作实现从自然语言需求到可运行3D游戏的端到端自动开发,解决了现有方法在游戏逻辑处理和引擎集成方面的不足,使非专业用户无需编程即可快速创建游戏。
English: This paper introduces UniGen, an automated multi-agent framework that enables zero-coding development of executable 3D games from natural language, overcoming current limitations in game generation by integrating planning, script generation, engine automation, and debugging.

Authors:Kyeongryeol Go
Title: Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis
Abstract:
The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at https://github.com/gokyeongryeol/ATES.
中文: 本文提出了一种自动化流程,通过微调的大型语言模型生成多样化文本提示,引导文本到图像模型合成具有挑战性的边缘案例,从而提升深度神经网络的鲁棒性,并在FishEye8K基准测试中验证了其优越性。
English: This paper introduces an automated pipeline that leverages a fine-tuned Large Language Model to generate diverse textual prompts, enabling a Text-to-Image model to synthesize challenging edge cases for improving deep neural network robustness, as validated on the FishEye8K benchmark.

Authors:Sachith Abeywickrama, Emadeldeen Eldele, Min Wu, Xiaoli Li, Chau Yuen
Title: EntroPE: Entropy-Guided Dynamic Patch Encoder for Time Series Forecasting
Abstract:
Transformer-based models have significantly advanced time series forecasting, with patch-based input strategies offering efficiency and improved long-horizon modeling. Yet, existing approaches rely on temporally-agnostic patch construction, where arbitrary starting positions and fixed lengths fracture temporal coherence by splitting natural transitions across boundaries. This naive segmentation often disrupts short-term dependencies and weakens representation learning. In response, we propose EntroPE (Entropy-Guided Dynamic Patch Encoder), a novel, temporally informed framework that dynamically detects transition points via conditional entropy and dynamically places patch boundaries. This preserves temporal structure while retaining the computational benefits of patching. EntroPE consists of two key modules, namely an Entropy-based Dynamic Patcher (EDP) that applies information-theoretic criteria to locate natural temporal shifts and determine patch boundaries, and an Adaptive Patch Encoder (APE) that employs pooling and cross-attention to capture intra-patch dependencies and produce fixed-size latent representations. These embeddings are then processed by a global transformer to model inter-patch dynamics. Experiments across long-term forecasting benchmarks demonstrate that EntroPE improves both accuracy and efficiency, establishing entropy-guided dynamic patching as a promising new paradigm for time series modeling. Code is available at: https://github.com/Sachithx/EntroPE.
中文摘要:提出的EntroPE框架通过熵引导的动态分块技术保持时间序列的时序连贯性,在保留计算效率的同时克服了固定分块方法的局限性。
English Summary: The proposed EntroPE framework introduces entropy-guided dynamic patching to preserve temporal coherence in time series forecasting, overcoming limitations of fixed patch segmentation while maintaining computational efficiency.

Authors:Asmita Sengupta, David Antony Selby, Sebastian Josef Vollmer, Gerrit Großmann
Title: MEDAKA: Construction of Biomedical Knowledge Graphs Using Large Language Models
Abstract:
Knowledge graphs (KGs) are increasingly used to represent biomedical information in structured, interpretable formats. However, existing biomedical KGs often focus narrowly on molecular interactions or adverse events, overlooking the rich data found in drug leaflets. In this work, we present (1) a hackable, end-to-end pipeline to create KGs from unstructured online content using a web scraper and an LLM; and (2) a curated dataset, MEDAKA, generated by applying this method to publicly available drug leaflets. The dataset captures clinically relevant attributes such as side effects, warnings, contraindications, ingredients, dosage guidelines, storage instructions and physical characteristics. We evaluate it through manual inspection and with an LLM-as-a-Judge framework, and compare its coverage with existing biomedical KGs and databases. We expect MEDAKA to support tasks such as patient safety monitoring and drug recommendation. The pipeline can also be used for constructing KGs from unstructured texts in other domains. Code and dataset are available at https://github.com/medakakg/medaka.
中文:本研究提出了一种从非结构化在线内容构建知识图谱的可扩展流程,特别通过处理药品说明书生成了MEDAKA数据集,该数据集涵盖全面的临床属性,并经过人工与自动评估验证,旨在支持生物医学应用。
English: This work introduces a flexible pipeline for generating knowledge graphs from unstructured online content, specifically creating the MEDAKA dataset from drug leaflets to capture comprehensive clinical attributes, which is validated through manual and automated evaluation to support biomedical applications.

Authors:Julian Valdez, Ignacio Torroba, John Folkesson, Ivan Stenius
Title: Side Scan Sonar-based SLAM for Autonomous Algae Farm Monitoring
Abstract:
The transition of seaweed farming to an alternative food source on an industrial scale relies on automating its processes through smart farming, equivalent to land agriculture. Key to this process are autonomous underwater vehicles (AUVs) via their capacity to automate crop and structural inspections. However, the current bottleneck for their deployment is ensuring safe navigation within farms, which requires an accurate, online estimate of the AUV pose and map of the infrastructure. To enable this, we propose an efficient side scan sonar-based (SSS) simultaneous localization and mapping (SLAM) framework that exploits the geometry of kelp farms via modeling structural ropes in the back-end as sequences of individual landmarks from each SSS ping detection, instead of combining detections into elongated representations. Our method outperforms state of the art solutions in hardware in the loop (HIL) experiments on a real AUV survey in a kelp farm. The framework and dataset can be found at https://github.com/julRusVal/sss_farm_slam.
中文摘要:海藻养殖向工业化转型需借助自主水下航行器实现自动化,但其在养殖场内的安全导航面临挑战;本研究提出一种高效的侧扫声纳同步定位与建图框架,将结构绳索建模为连续地标序列,在实际测试中优于现有解决方案。
English Summary: The transition to industrial-scale seaweed farming requires automation through autonomous underwater vehicles (AUVs), which face navigation challenges in kelp farms; this study proposes an efficient side scan sonar-based SLAM framework that models structural ropes as sequences of landmarks, outperforming existing methods in real-world experiments.

Authors:Shuai Wang, Liang Ding, Li Shen, Yong Luo, Han Hu, Lefei Zhang, Fu Lin
Title: A Multi-Language Object-Oriented Programming Benchmark for Large Language Models
Abstract:
Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming language; 94.3% target only function-level or statement-level tasks; and over 80% include fewer than ten test cases on average. To address these gaps, we propose MultiOOP, a multi-language object-oriented programming benchmark covering six popular languages (Python, PHP, C++, C#, Java, JavaScript) with 267 tasks per language. We design a translator that extends an existing single-language OOP benchmark and the pass@o metric to a multilingual setting. Moreover, we propose an automated framework for augmenting test cases to ensure the reliability of the evaluation results. We evaluate 14 mainstream LLMs under zero-shot prompting and report three key findings: 1) Substantial performance degradation: pass@1 scores on MultiOOP drop by up to 65.6 percentage points compared to function-level tasks (e.g., HumanEval). 2) Cross-language variability: GPT-4o mini achieves pass@1 of 48.06% in Python but only 0.12%-15.26% in other languages, indicating limited multilingual generalization. 3) Conceptual gaps: pass@o scores are consistently 1.1-19.2 points lower than pass@k, demonstrating that LLMs often generate executable code without fully capturing core OOP concepts. Our benchmark, metric extensions, and evaluation scripts will be publicly released to foster a more balanced and comprehensive assessment of LLMs in object-oriented code generation. Our code and data will be released at https://github.com/alphadl/OOP-eval and https://huggingface.co/datasets/codeai-dteam/MultiOOP respectively.
中文:本文提出MultiOOP多语言面向对象编程基准,通过覆盖六种编程语言和大量任务解决现有代码生成评估的不平衡问题,发现大型语言模型存在显著性能下降和有限的多语言泛化能力。
English: This paper introduces MultiOOP, a multilingual object-oriented programming benchmark addressing imbalances in existing code generation evaluations by covering six languages and extensive tasks, revealing significant performance drops and limited generalization in LLMs.

Authors:Shigui Li, Wei Chen, Delu Zeng
Title: EVODiff: Entropy-aware Variance Optimized Diffusion Inference
Abstract:
Diffusion models (DMs) excel in image generation, but suffer from slow inference and the training-inference discrepancies. Although gradient-based solvers like DPM-Solver accelerate the denoising inference, they lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5\% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25\% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.
Chinese: 本文提出EVODiff方法,通过优化条件熵来降低去噪过程中的不确定性,显著提升了扩散模型的生成效率与质量,在多项图像生成任务中优于现有梯度求解器。
English: This paper introduces EVODiff, an entropy-aware variance optimization method that enhances diffusion models by reducing conditional entropy during denoising, achieving superior performance over existing solvers in image generation tasks.

Authors:Leitian Tao, Xuefeng Du, Yixuan Li
Title: Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis
Abstract:
Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing textual data synthesis methods are computationally expensive. We propose a novel framework LENS for synthesizing preference data directly in the LLM's latent embedding space. Our method employs a Variational Autoencoder (VAE) to learn a structured latent representation of response embeddings. By performing controlled perturbations in this latent space and decoding back to the embedding space, we efficiently generate diverse, semantically consistent synthetic preference pairs, bypassing costly text generation and annotation. We provide theoretical guarantees that our synthesized pairs approximately preserve original preference ordering and improve reward model generalization. Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18x faster in generation and using a 16,000x smaller model. Our work offers a scalable and effective alternative for enhancing reward modeling through efficient data augmentation. Code is publicly available at https://github.com/deeplearning-wisc/lens
中文摘要:LENS框架通过变分自编码器在大语言模型的潜在嵌入空间中直接合成偏好数据,相比基于文本的方法,不仅生成速度提升18倍、模型规模缩小16000倍,还在奖励模型泛化性能上取得显著优势,为高效数据增强提供了创新解决方案。
English Summary: The LENS framework introduces a novel approach to efficiently synthesize preference data in the latent embedding space of large language models using a Variational Autoencoder, significantly outperforming text-based methods in speed and model size while improving reward model generalization.

Authors:Ioana Ciuclea, Giorgio Longari, Alice Barbara Tumpach
Title: Geometric Learning of Canonical Parameterizations of $2D$-curves
Abstract:
Most datasets encountered in computer vision and medical applications present symmetries that should be taken into account in classification tasks. A typical example is the symmetry by rotation and/or scaling in object detection. A common way to build neural networks that learn the symmetries is to use data augmentation. In order to avoid data augmentation and build more sustainable algorithms, we present an alternative method to mod out symmetries based on the notion of section of a principal fiber bundle. This framework allows the use of simple metrics on the space of objects in order to measure dissimilarities between orbits of objects under the symmetry group. Moreover, the section used can be optimized to maximize separation of classes. We illustrate this methodology on a dataset of contours of objects for the groups of translations, rotations, scalings and reparameterizations. In particular, we present a $2$-parameter family of canonical parameterizations of curves, containing the constant-speed parameterization as a special case, which we believe is interesting in its own right. We hope that this simple application will serve to convey the geometric concepts underlying this method, which have a wide range of possible applications. The code is available at the following link: $\href{https://github.com/GiLonga/Geometric-Learning}{https://github.com/GiLonga/Geometric-Learning}$. A tutorial notebook showcasing an application of the code to a specific dataset is available at the following link: $\href{https://github.com/ioanaciuclea/geometric-learning-notebook}{https://github.com/ioanaciuclea/geometric-learning-notebook}$
中文: 本文提出了一种基于主纤维丛截面的几何框架,用于消除数据集中的对称性,为分类任务提供了一种比数据增强更可持续的方法,并在物体轮廓数据集上展示了其应用及一种新的参数化技术。
English: This paper introduces a geometric framework using principal fiber bundle sections to eliminate symmetries in datasets, offering a sustainable alternative to data augmentation for classification tasks and demonstrating its application on object contours with a novel parameterization method.

Authors:Yanbo Wang, Zixiang Xu, Yue Huang, Xiangqi Wang, Zirui Song, Lang Gao, Chenxi Wang, Xiangru Tang, Yue Zhao, Arman Cohan, Xiangliang Zhang, Xiuying Chen
Title: DyFlow: Dynamic Workflow Framework for Agentic Reasoning
Abstract:
Agent systems based on large language models (LLMs) have shown great potential in complex reasoning tasks, but building efficient and generalizable workflows remains a major challenge. Most existing approaches rely on manually designed processes, which limits their adaptability across different tasks. While a few methods attempt automated workflow generation, they are often tied to specific datasets or query types and make limited use of intermediate feedback, reducing system robustness and reasoning depth. Moreover, their operations are typically predefined and inflexible. To address these limitations, we propose DyFlow, a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures based on task requirements and real-time intermediate feedback, thereby enhancing cross-task generalization. DyFlow consists of two core components: a designer and an executor. The designer decomposes complex problems into a sequence of sub-goals defined by high-level objectives and dynamically plans the next steps based on intermediate outputs and feedback. These plans are then carried out by the executor, which executes each operation using dynamic operators with context-aware parameterization, enabling flexible and semantically grounded reasoning. We systematically evaluate DyFlow across diverse domains, including social reasoning, biomedical tasks, mathematical problem solving, and code generation. Results demonstrate that DyFlow significantly outperforms existing baselines, achieving substantial Pass@k improvements and exhibiting robust generalization across diverse domains. The code is publicly available at https://github.com/wyf23187/DyFlow.
中文摘要:DyFlow是一种动态工作流生成框架,通过实时反馈自适应构建和调整推理流程,在社交推理、生物医学任务和代码生成等多个领域显著优于现有方法。
English Summary: DyFlow is a dynamic workflow generation framework that adaptively constructs and adjusts reasoning procedures using real-time feedback, significantly outperforming existing methods across diverse domains including social reasoning, biomedical tasks, and code generation.

Authors:Yang Zhou, Kunhao Yuan, Ye Wei, Jishizhan Chen
Title: Multi-modal Liver Segmentation and Fibrosis Staging Using Real-world MRI Images
Abstract:
Liver fibrosis represents the accumulation of excessive extracellular matrix caused by sustained hepatic injury. It disrupts normal lobular architecture and function, increasing the chances of cirrhosis and liver failure. Precise staging of fibrosis for early diagnosis and intervention is often invasive, which carries risks and complications. To address this challenge, recent advances in artificial intelligence-based liver segmentation and fibrosis staging offer a non-invasive alternative. As a result, the CARE 2025 Challenge aimed for automated methods to quantify and analyse liver fibrosis in real-world scenarios, using multi-centre, multi-modal, and multi-phase MRI data. This challenge included tasks of precise liver segmentation (LiSeg) and fibrosis staging (LiFS). In this study, we developed an automated pipeline for both tasks across all the provided MRI modalities. This pipeline integrates pseudo-labelling based on multi-modal co-registration, liver segmentation using deep neural networks, and liver fibrosis staging based on shape, textural, appearance, and directional (STAD) features derived from segmentation masks and MRI images. By solely using the released data with limited annotations, our proposed pipeline demonstrated excellent generalisability for all MRI modalities, achieving top-tier performance across all competition subtasks. This approach provides a rapid and reproducible framework for quantitative MRI-based liver fibrosis assessment, supporting early diagnosis and clinical decision-making. Code is available at https://github.com/YangForever/care2025_liver_biodreamer.
中文摘要:近期人工智能技术通过多模态MRI实现肝脏自动分割和纤维化分期,提供了一种快速、可重复的无创评估框架,有助于早期诊断和临床决策支持。
English Summary: Recent AI advancements enable non-invasive liver fibrosis assessment through automated segmentation and staging using multi-modal MRI, offering a rapid, reproducible framework for early diagnosis and clinical decision-making.

Authors:Gagandeep Singh, Samudi Amarsinghe, Priyanka Singh, Xue Li
Title: DGM4+: Dataset Extension for Global Scene Inconsistency
Abstract:
The rapid advances in generative models have significantly lowered the barrier to producing convincing multimodal disinformation. Fabricated images and manipulated captions increasingly co-occur to create persuasive false narratives. While the Detecting and Grounding Multi-Modal Media Manipulation (DGM4) dataset established a foundation for research in this area, it is restricted to local manipulations such as face swaps, attribute edits, and caption changes. This leaves a critical gap: global inconsistencies, such as mismatched foregrounds and backgrounds, which are now prevalent in real-world forgeries. To address this, we extend DGM4 with 5,000 high-quality samples that introduce Foreground-Background (FG-BG) mismatches and their hybrids with text manipulations. Using OpenAI's gpt-image-1 and carefully designed prompts, we generate human-centric news-style images where authentic figures are placed into absurd or impossible backdrops (e.g., a teacher calmly addressing students on the surface of Mars). Captions are produced under three conditions: literal, text attribute, and text split, yielding three new manipulation categories: FG-BG, FG-BG+TA, and FG-BG+TS. Quality control pipelines enforce one-to-three visible faces, perceptual hash deduplication, OCR-based text scrubbing, and realistic headline length. By introducing global manipulations, our extension complements existing datasets, creating a benchmark DGM4+ that tests detectors on both local and global reasoning. This resource is intended to strengthen evaluation of multimodal models such as HAMMER, which currently struggle with FG-BG inconsistencies. We release our DGM4+ dataset and generation script at https://github.com/Gaganx0/DGM4plus
中文: 本研究扩展了DGM4数据集,新增5000个包含前景-背景错位及混合文本篡改的样本,以解决多模态虚假信息中的全局不一致性问题,建立了DGM4+基准来提升检测模型的评估效果。
English: This study extends the DGM4 dataset with 5,000 samples featuring foreground-background mismatches and hybrid text manipulations to address global inconsistencies in multimodal disinformation, creating the DGM4+ benchmark for improved detection model evaluation.

Authors:Gagandeep Singh, Samudi Amarsinghe, Urawee Thani, Ki Fung Wong, Priyanka Singh, Xue Li
Title: SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies
Abstract:
We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs
中文摘要:本研究通过引入分割引导评分(SGS)方法增强HAMMER模型,有效检测图像篡改中的全局场景不一致性,无需重新训练即可提升检测精度,同时保持计算效率。
English Summary: This study enhances the HAMMER model by introducing a segmentation-guided scoring (SGS) method to detect global scene inconsistencies in manipulated images, improving detection accuracy without retraining while maintaining computational efficiency.

Authors:Christoph Timmermann, Hyunse Lee, Woojin Lee
Title: SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
Abstract:
While Contrastive Language-Image Pretraining (CLIP) excels at zero-shot tasks by aligning image and text embeddings, its performance in few-shot classification is hindered by a critical limitation: intra-modal misalignment. This issue, caused by a persistent modality gap and CLIP's exclusively inter-modal training objective, leaves the embedding spaces uncalibrated, making direct image-to-image comparisons unreliable. Existing methods attempt to address this by refining similarity logits or by computationally expensive per-sample optimization. To overcome these challenges, we introduce SeMoBridge, a lightweight yet powerful approach that directly addresses the misalignment. Our method maps images into the text modality, while keeping their semantic content intact through what we call a Semantic Modality Bridge. SeMoBridge is closed-form and can optionally be trained through multi-modal supervision, combining image and text-alignment losses to optimize the projection. Experiments show that the trained version, SeMoBridge-T, requires only a fraction of the training time while overall outperforming other methods, particularly in low-data scenarios (1, 2, and 4 shots). The code is available at https://github.com/christti98/semobridge.
中文: SeMoBridge 是一种轻量级方法,通过将图像映射到文本模态并保持语义完整性来解决 CLIP 的模态内错位问题,在少量样本场景中以极短训练时间实现卓越性能。
English: SeMoBridge is a lightweight method that addresses CLIP's intra-modal misalignment by mapping images into the text modality while preserving semantics, achieving superior few-shot performance with minimal training time.

Authors:Daphne Theodorakopoulos, Elisabeth Eberling, Miriam Bodenheimer, Sabine Loos, Frederic Stahl
Title: FITS: Towards an AI-Driven Fashion Information Tool for Sustainability
Abstract:
Access to credible sustainability information in the fashion industry remains limited and challenging to interpret, despite growing public and regulatory demands for transparency. General-purpose language models often lack domain-specific knowledge and tend to "hallucinate", which is particularly harmful for fields where factual correctness is crucial. This work explores how Natural Language Processing (NLP) techniques can be applied to classify sustainability data for fashion brands, thereby addressing the scarcity of credible and accessible information in this domain. We present a prototype Fashion Information Tool for Sustainability (FITS), a transformer-based system that extracts and classifies sustainability information from credible, unstructured text sources: NGO reports and scientific publications. Several BERT-based language models, including models pretrained on scientific and climate-specific data, are fine-tuned on our curated corpus using a domain-specific classification schema, with hyperparameters optimized via Bayesian optimization. FITS allows users to search for relevant data, analyze their own data, and explore the information via an interactive interface. We evaluated FITS in two focus groups of potential users concerning usability, visual design, content clarity, possible use cases, and desired features. Our results highlight the value of domain-adapted NLP in promoting informed decision-making and emphasize the broader potential of AI applications in addressing climate-related challenges. Finally, this work provides a valuable dataset, the SustainableTextileCorpus, along with a methodology for future updates. Code available at https://github.com/daphne12345/FITS
中文: 本研究开发了基于Transformer的FITS工具,通过自然语言处理技术对时尚行业可持续性信息进行分类,以解决可信数据匮乏的问题,并验证了领域专用模型在提升决策准确性方面的重要价值。
English: This study introduces FITS, a transformer-based NLP tool that classifies sustainability information from credible sources to address the lack of accessible data in the fashion industry, demonstrating the value of domain-specific models for accurate decision-making.

Authors:Lubian Bai, Xiuyuan Zhang, Siqi Zhang, Zepeng Zhang, Haoyu Wang, Wei Qin, Shihong Du
Title: GeoLink: Empowering Remote Sensing Foundation Model with OpenStreetMap Data
Abstract:
Integrating ground-level geospatial data with rich geographic context, like OpenStreetMap (OSM), into remote sensing (RS) foundation models (FMs) is essential for advancing geospatial intelligence and supporting a broad spectrum of tasks. However, modality gap between RS and OSM data, including differences in data structure, content, and spatial granularity, makes effective synergy highly challenging, and most existing RS FMs focus on imagery alone. To this end, this study presents GeoLink, a multimodal framework that leverages OSM data to enhance RS FM during both the pretraining and downstream task stages. Specifically, GeoLink enhances RS self-supervised pretraining using multi-granularity learning signals derived from OSM data, guided by cross-modal spatial correlations for information interaction and collaboration. It also introduces image mask-reconstruction to enable sparse input for efficient pretraining. For downstream tasks, GeoLink generates both unimodal and multimodal fine-grained encodings to support a wide range of applications, from common RS interpretation tasks like land cover classification to more comprehensive geographic tasks like urban function zone mapping. Extensive experiments show that incorporating OSM data during pretraining enhances the performance of the RS image encoder, while fusing RS and OSM data in downstream tasks improves the FM's adaptability to complex geographic scenarios. These results underscore the potential of multimodal synergy in advancing high-level geospatial artificial intelligence. Moreover, we find that spatial correlation plays a crucial role in enabling effective multimodal geospatial data integration. Code, checkpoints, and using examples are released at https://github.com/bailubin/GeoLink_NeurIPS2025
中文: 本研究提出GeoLink多模态框架,通过在预训练和下游任务中融合OpenStreetMap数据与遥感基础模型,有效弥合模态差异并提升地理空间智能任务的性能表现。
English: This study introduces GeoLink, a multimodal framework that integrates OpenStreetMap data with remote sensing foundation models during pretraining and downstream tasks to bridge modality gaps and enhance performance in geospatial intelligence applications.

Authors:Subramanya Nagabhushanaradhya
Title: OpenID Connect for Agents (OIDC-A) 1.0: A Standard Extension for LLM-Based Agent Identity and Authorization
Abstract:
OpenID Connect for Agents (OIDC-A) 1.0 is an extension to OpenID Connect Core 1.0 that provides a comprehensive framework for representing, authenticating, and authorizing LLM-based agents within the OAuth 2.0 ecosystem. As autonomous AI agents become increasingly prevalent in digital systems, there is a critical need for standardized protocols to establish agent identity, verify agent attestation, represent delegation chains, and enable fine-grained authorization based on agent attributes. This specification defines standard claims, endpoints, and protocols that address these requirements while maintaining compatibility with existing OAuth 2.0 and OpenID Connect infrastructure. The proposed framework introduces mechanisms for agent identity representation, delegation chain validation, attestation verification, and capability-based authorization, providing a foundation for secure and trustworthy agent-to-service interactions in modern distributed systems.
中文: OIDC-A 1.0 扩展了OpenID Connect协议,为基于大语言模型的智能代理在OAuth 2.0生态系统中提供完整的身份验证与授权框架,通过代理身份表示、委托链验证和能力授权等机制确保分布式系统中的安全交互。
English: OIDC-A 1.0 extends OpenID Connect to establish a standardized framework for authenticating and authorizing AI agents within OAuth 2.0 systems, introducing mechanisms for identity representation, delegation validation, and capability-based authorization.

Authors:Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester
Title: A Multi-purpose Tracking Framework for Salmon Welfare Monitoring in Challenging Environments
Abstract:
Computer Vision (CV)-based continuous, automated and precise salmon welfare monitoring is a key step toward reduced salmon mortality and improved salmon welfare in industrial aquaculture net pens. Available CV methods for determining welfare indicators focus on single indicators and rely on object detectors and trackers from other application areas to aid their welfare indicator calculation algorithm. This comes with a high resource demand for real-world applications, since each indicator must be calculated separately. In addition, the methods are vulnerable to difficulties in underwater salmon scenes, such as object occlusion, similar object appearance, and similar object motion. To address these challenges, we propose a flexible tracking framework that uses a pose estimation network to extract bounding boxes around salmon and their corresponding body parts, and exploits information about the body parts, through specialized modules, to tackle challenges specific to underwater salmon scenes. Subsequently, the high-detail body part tracks are employed to calculate welfare indicators. We construct two novel datasets assessing two salmon tracking challenges: salmon ID transfers in crowded scenes and salmon ID switches during turning. Our method outperforms the current state-of-the-art pedestrian tracker, BoostTrack, for both salmon tracking challenges. Additionally, we create a dataset for calculating salmon tail beat wavelength, demonstrating that our body part tracking method is well-suited for automated welfare monitoring based on tail beat analysis. Datasets and code are available at https://github.com/espenbh/BoostCompTrack.
中文: 一种基于姿态估计和专用模块的新型计算机视觉框架,能够在水下环境中有效追踪鲑鱼及其身体部位,在拥挤和转向场景中优于现有方法,并通过尾鳍摆动分析实现自动化福利监测。
English: A novel computer vision framework using pose estimation and specialized modules effectively tracks salmon and their body parts in underwater environments, outperforming existing methods in crowded and turning scenarios while enabling automated welfare monitoring through tail beat analysis.

Authors:Yuan Zhao, Youwei Pang, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu, Xiaoqi Zhao
Title: UniMMAD: Unified Multi-Modal and Multi-Class Anomaly Detection via MoE-Driven Feature Decompression
Abstract:
Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.
Chinese: UniMMAD提出了一种统一的多模态多类别异常检测框架,通过基于专家混合的特征解压缩机制实现自适应解耦重建,在多个数据集上达到最优性能,同时将参数减少75%。
English: UniMMAD introduces a unified framework for multi-modal and multi-class anomaly detection, utilizing a Mixture-of-Experts-driven feature decompression mechanism to enable adaptive, disentangled reconstruction and achieve state-of-the-art performance across diverse datasets while reducing parameters by 75%.

Authors:Zhicheng Zhou, Jing Li, Suming Qiu, Junjie Huang, Linyuan Qiu, Zhijie Sun
Title: DeepJSONEval: Benchmarking Complex Nested JSON Data Mining for Large Language Models
Abstract:
The internet is saturated with low-density, high-redundancy information, such as social media comments, repetitive news, and lengthy discussions, making it difficult to extract valuable insights efficiently. Multi-layer nested JSON structures provide an effective solution by compressing such information into semantically rich, hierarchical representations, which organize data into key-value pairs, arrays, and nested objects, preserving contextual relationships and enabling efficient storage, retrieval, and semantic querying. For instance, in news aggregation, a JSON object can nest an article's metadata (title, author, date), content (text, multimedia), and multimedia information (multimedia type, caption) hierarchically. Large Language Models (LLMs) play a transformative role in web data mining by parsing unstructured text and outputting structured results directly into complex JSON schemas. However, current benchmarks for evaluating LLMs' JSON output capabilities overemphasize pure JSON generation rather than assessing data comprehension and extraction abilities, a limitation that lacks relevance to practical web data mining tasks. To address this, we introduce DeepJSONEval, a novel benchmark featuring 2100 multi-domain instances with deep nested structures, categorized by difficulty. Experiments show significant performance gaps among LLMs in handling such complexity. Our benchmark and datasets are open-sourced to advance research in structured JSON generation.(https://github.com/GTS-AI-Infra-Lab-SotaS/DeepJSONEval).
中文摘要:互联网信息过载问题可通过多层嵌套JSON结构实现高效分层压缩,而现有大语言模型基准过于侧重格式生成却忽略实际数据提取能力,为此推出DeepJSONEval基准以评估复杂JSON处理性能。
English Summary: The internet's information overload is effectively managed by using multi-layer nested JSON structures for hierarchical data compression, while current LLM benchmarks inadequately assess practical data extraction skills, prompting the introduction of the DeepJSONEval benchmark to evaluate complex JSON generation.

Authors:Yindong Wang, Martin Preiß, Margarita Bugueño, Jan Vincent Hoffbauer, Abdullatif Ghajar, Tolga Buz, Gerard de Melo
Title: ReFACT: A Benchmark for Scientific Confabulation Detection with Positional Error Annotations
Abstract:
Large Language Models (LLMs) frequently confabulate scientific facts, severely undermining their trustworthiness. Addressing this challenge requires benchmarks that go beyond binary factuality and enable fine-grained evaluation. We introduce ReFACT (Reddit False And Correct Texts), a benchmark of 1,001 expert-annotated question-answer pairs spanning diverse scientific domains for the detection of scientific confabulation. Each instance includes both a scientifically correct answer and a non-factual counterpart annotated with precise error spans and error types. ReFACT enables multi-stage evaluation: (1) confabulation detection, (2) fine-grained error localization, and (3) correction. We benchmark 9 state-of-the-art LLMs, revealing limited performance (about 50 percent accuracy). Even top models such as GPT-4o fail to distinguish factual from confabulated scientific answers, raising concerns about the reliability of LLM-as-judge evaluation paradigms. Our findings highlight the need for fine-grained, human-validated benchmarks to detect and correct scientific confabulation in domain-specific contexts. The dataset is available at: https://github.com/ddz5431/ReFACT
大语言模型常捏造科学事实,为此开发了ReFACT基准,用于在多个科学领域中对这些错误进行精细检测和修正。
Large language models often fabricate scientific information, prompting the creation of the ReFACT benchmark for detailed detection and correction of these errors across multiple scientific fields.

Authors:Olga Krestinskaya, Mohammed E. Fouda, Ahmed Eltawil, Khaled N. Salama
Title: CIMNAS: A Joint Framework for Compute-In-Memory-Aware Neural Architecture Search
Abstract:
To maximize hardware efficiency and performance accuracy in Compute-In-Memory (CIM)-based neural network accelerators for Artificial Intelligence (AI) applications, co-optimizing both software and hardware design parameters is essential. Manual tuning is impractical due to the vast number of parameters and their complex interdependencies. To effectively automate the design and optimization of CIM-based neural network accelerators, hardware-aware neural architecture search (HW-NAS) techniques can be applied. This work introduces CIMNAS, a joint model-quantization-hardware optimization framework for CIM architectures. CIMNAS simultaneously searches across software parameters, quantization policies, and a broad range of hardware parameters, incorporating device-, circuit-, and architecture-level co-optimizations. CIMNAS experiments were conducted over a search space of 9.9x10^85 potential parameter combinations with the MobileNet model as a baseline and RRAM-based CIM architecture. Evaluated on the ImageNet dataset, CIMNAS achieved a reduction in energy-delay-area product (EDAP) ranging from 90.1x to 104.5x, an improvement in TOPS/W between 4.68x and 4.82x, and an enhancement in TOPS/mm^2 from 11.3x to 12.78x relative to various baselines, all while maintaining an accuracy of 73.81%. The adaptability and robustness of CIMNAS are demonstrated by extending the framework to support the SRAM-based ResNet50 architecture, achieving up to an 819.5x reduction in EDAP. Unlike other state-of-the-art methods, CIMNAS achieves EDAP-focused optimization without any accuracy loss, generating diverse software-hardware parameter combinations for high-performance CIM-based neural network designs. The source code of CIMNAS is available at https://github.com/OlgaKrestinskaya/CIMNAS.
中文:CIMNAS是一个联合优化软件、量化和硬件参数的计算内存神经网络加速器框架,能在保持精度的同时显著提升能效。
English: CIMNAS is a comprehensive framework that co-optimizes software, quantization, and hardware parameters for compute-in-memory neural network accelerators, achieving significant efficiency gains without accuracy loss.

Authors:Boyoung Kim, Dosung Lee, Sumin An, Jinseong Jeong, Paul Hongsuck Seo
Title: ReTAG: Retrieval-Enhanced, Topic-Augmented Graph-Based Global Sensemaking
Abstract:
Recent advances in question answering have led to substantial progress in tasks such as multi-hop reasoning. However, global sensemaking-answering questions by synthesizing information from an entire corpus remains a significant challenge. A prior graph-based approach to global sensemaking lacks retrieval mechanisms, topic specificity, and incurs high inference costs. To address these limitations, we propose ReTAG, a Retrieval-Enhanced, Topic-Augmented Graph framework that constructs topic-specific subgraphs and retrieves the relevant summaries for response generation. Experiments show that ReTAG improves response quality while significantly reducing inference time compared to the baseline. Our code is available at https://github.com/bykimby/retag.
中文: ReTAG是一种新颖的框架,通过构建主题特定子图并检索相关摘要来增强问答中的全局理解,在提高回答质量的同时显著减少了推理时间。
English: ReTAG is a novel framework that enhances global sensemaking in question answering by constructing topic-specific subgraphs and retrieving relevant summaries, improving response quality while reducing inference time.

Authors:Yuan Gao, Sangwook Kim, Chris McIntosh
Title: EchoingECG: An Electrocardiogram Cross-Modal Model for Echocardiogram Tasks
Abstract:
Electrocardiogram (ECG) is a widely used tool for assessing cardiac function due to its low cost and accessibility. Emergent research shows that ECGs can help make predictions on key outcomes traditionally derived from more complex modalities such as echocardiograms (ECHO), enabling the use of ECGs as a more accessible method to predict broader measurements of cardiac function. ECHO, in particular, are of great importance because they require considerable hospital resources while playing a key role in clinical cardiac assessment. To aid this use case, we introduce EchoingECG, a probabilistic student-teacher model that leverages uncertainty-aware ECG embeddings and ECHO supervision to improve ECG-based cardiac function prediction. Our approach integrates Probabilistic Cross-Modal Embeddings (PCME++), a probabilistic contrastive framework, with ECHO-CLIP, a vision-language pre-trained model trained on ECHO-text pairs, to distill ECHO knowledge into ECG representations. Through experiments and external validation, we showed that EchoingECG outperforms state-of-the-art foundation ECG models in zero-shot, few-shot, and fine-tune settings for ECHO predictions based on ECG. We also highlighted that variance estimation (enabled through our method) enhanced our understanding of model performance by identifying underlying regions of uncertainty within ECGs. The code is available: https://github.com/mcintoshML/EchoingECG.
中文摘要:本研究提出的EchoingECG概率师生模型,通过不确定性感知的心电图嵌入和超声心动图监督,将超声知识蒸馏到心电图表征中,显著提升了基于心电图的心脏功能预测性能,在多场景下超越现有先进模型。
English Summary: The study introduces EchoingECG, a probabilistic student-teacher model that enhances ECG-based cardiac function prediction by distilling knowledge from echocardiograms through uncertainty-aware embeddings and cross-modal integration, outperforming existing methods across various settings.

Authors:Amber Srivastava, Salar Basiri, Srinivasa Salapaka
Title: Autonomy-Aware Clustering: When Local Decisions Supersede Global Prescriptions
Abstract:
Clustering arises in a wide range of problem formulations, yet most existing approaches assume that the entities under clustering are passive and strictly conform to their assigned groups. In reality, entities often exhibit local autonomy, overriding prescribed associations in ways not fully captured by feature representations. Such autonomy can substantially reshape clustering outcomes -- altering cluster compositions, geometry, and cardinality -- with significant downstream effects on inference and decision-making. We introduce autonomy-aware clustering, a reinforcement learning (RL) framework that learns and accounts for the influence of local autonomy without requiring prior knowledge of its form. Our approach integrates RL with a Deterministic Annealing (DA) procedure, where, to determine underlying clusters, DA naturally promotes exploration in early stages of annealing and transitions to exploitation later. We also show that the annealing procedure exhibits phase transitions that enable design of efficient annealing schedules. To further enhance adaptability, we propose the Adaptive Distance Estimation Network (ADEN), a transformer-based attention model that learns dependencies between entities and cluster representatives within the RL loop, accommodates variable-sized inputs and outputs, and enables knowledge transfer across diverse problem instances. Empirical results show that our framework closely aligns with underlying data dynamics: even without explicit autonomy models, it achieves solutions close to the ground truth (gap ~3-4%), whereas ignoring autonomy leads to substantially larger gaps (~35-40%). The code and data are publicly available at https://github.com/salar96/AutonomyAwareClustering.
中文摘要:本文提出了一种自主感知聚类框架,结合强化学习和确定性退火算法来考虑实体的局部自主性,无需先验自主模型即可获得接近真实情况的聚类结果。
English summary: This paper introduces an autonomy-aware clustering framework using reinforcement learning and deterministic annealing to account for entities' local autonomy, achieving near-ground-truth results without prior autonomy models.

Authors:Jia Jun Cheng Xian, Muchen Li, Haotian Yang, Xin Tao, Pengfei Wan, Leonid Sigal, Renjie Liao
Title: Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs
Abstract:
Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.
中文: 本文提出文本偏好优化(TPO)框架,通过训练模型区分匹配与不匹配提示词来实现文本-图像模型的免标注对齐,在多个基准测试中均显著提升人类偏好分数与图文对齐效果。
English: This paper introduces Text Preference Optimization (TPO), a novel framework that enhances text-to-image model alignment by training models to prefer matched over mismatched prompts, eliminating the need for costly human-annotated image preference data while outperforming existing methods in human preference scores and alignment accuracy.

Authors:Xinyu Pu, Hongsong Wang, Jie Gui, Pan Zhou
Title: Dragging with Geometry: From Pixels to Geometry-Guided Image Editing
Abstract:
Interactive point-based image editing serves as a controllable editor, enabling precise and flexible manipulation of image content. However, most drag-based methods operate primarily on the 2D pixel plane with limited use of 3D cues. As a result, they often produce imprecise and inconsistent edits, particularly in geometry-intensive scenarios such as rotations and perspective transformations. To address these limitations, we propose a novel geometry-guided drag-based image editing method - GeoDrag, which addresses three key challenges: 1) incorporating 3D geometric cues into pixel-level editing, 2) mitigating discontinuities caused by geometry-only guidance, and 3) resolving conflicts arising from multi-point dragging. Built upon a unified displacement field that jointly encodes 3D geometry and 2D spatial priors, GeoDrag enables coherent, high-fidelity, and structure-consistent editing in a single forward pass. In addition, a conflict-free partitioning strategy is introduced to isolate editing regions, effectively preventing interference and ensuring consistency. Extensive experiments across various editing scenarios validate the effectiveness of our method, showing superior precision, structural consistency, and reliable multi-point editability. The code will be available on https://github.com/xinyu-pu/GeoDrag .
中文: GeoDrag提出了一种几何引导的拖拽式图像编辑方法,通过融合三维几何线索与二维空间先验的统一位移场,在单次前向传播中实现精确、连贯且结构一致的编辑,并采用无冲突分区策略有效解决多点拖拽的干扰问题。
English: GeoDrag introduces a geometry-guided drag-based image editing method that integrates 3D geometric cues with 2D spatial priors through a unified displacement field, enabling precise, coherent, and structure-consistent edits in a single forward pass while resolving multi-point conflicts via region partitioning.

Authors:Huikang Su, Dengyun Peng, Zifeng Zhuang, YuHan Liu, Qiguang Chen, Donglin Wang, Qinghe Liu
Title: Boundary-to-Region Supervision for Offline Safe Reinforcement Learning
Abstract:
Offline safe reinforcement learning aims to learn policies that satisfy predefined safety constraints from static datasets. Existing sequence-model-based methods condition action generation on symmetric input tokens for return-to-go and cost-to-go, neglecting their intrinsic asymmetry: return-to-go (RTG) serves as a flexible performance target, while cost-to-go (CTG) should represent a rigid safety boundary. This symmetric conditioning leads to unreliable constraint satisfaction, especially when encountering out-of-distribution cost trajectories. To address this, we propose Boundary-to-Region (B2R), a framework that enables asymmetric conditioning through cost signal realignment . B2R redefines CTG as a boundary constraint under a fixed safety budget, unifying the cost distribution of all feasible trajectories while preserving reward structures. Combined with rotary positional embeddings , it enhances exploration within the safe region. Experimental results show that B2R satisfies safety constraints in 35 out of 38 safety-critical tasks while achieving superior reward performance over baseline methods. This work highlights the limitations of symmetric token conditioning and establishes a new theoretical and practical approach for applying sequence models to safe RL. Our code is available at https://github.com/HuikangSu/B2R.
中文摘要:本研究提出的边界到区域(B2R)框架通过成本信号重对齐实现非对称条件处理,有效解决了序列模型在离线安全强化学习中对成本约束的对称处理缺陷,在38项安全关键任务中实现35项的安全约束满足,同时获得优于基线方法的奖励表现。
English Summary: The proposed Boundary-to-Region (B2R) framework addresses limitations in offline safe reinforcement learning by introducing asymmetric conditioning of cost-to-go signals, enabling reliable safety constraint satisfaction while maintaining high reward performance across diverse tasks.

Authors:Shunpeng Chen, Changwei Wang, Rongtao Xu, Xingtian Pei, Yukun Song, Jinzhou Lin, Wenhao Xu, Jingyi Zhang, Li Guo, Shibiao Xu
Title: SAGE: Spatial-visual Adaptive Graph Exploration for Visual Place Recognition
Abstract:
Visual Place Recognition (VPR) requires robust retrieval of geotagged images despite large appearance, viewpoint, and environmental variation. Prior methods focus on descriptor fine-tuning or fixed sampling strategies yet neglect the dynamic interplay between spatial context and visual similarity during training. We present SAGE (Spatial-visual Adaptive Graph Exploration), a unified training pipeline that enhances granular spatial-visual discrimination by jointly improving local feature aggregation, organize samples during training, and hard sample mining. We introduce a lightweight Soft Probing module that learns residual weights from training data for patch descriptors before bilinear aggregation, boosting distinctive local cues. During training we reconstruct an online geo-visual graph that fuses geographic proximity and current visual similarity so that candidate neighborhoods reflect the evolving embedding landscape. To concentrate learning on the most informative place neighborhoods, we seed clusters from high-affinity anchors and iteratively expand them with a greedy weighted clique expansion sampler. Implemented with a frozen DINOv2 backbone and parameter-efficient fine-tuning, SAGE achieves SOTA across eight benchmarks. It attains 98.9%, 95.8%, 94.5%, and 96.0% Recall@1 on SPED, Pitts30k-test, MSLS-val, and Nordland, respectively. Notably, our method obtains 100% Recall@10 on SPED only using 4096D global descriptors. Code and model will be available at: https://github.com/chenshunpeng/SAGE.
中文摘要:SAGE提出了一种统一的训练流程,通过自适应图探索和困难样本挖掘动态整合空间上下文与视觉相似性,从而在多个基准测试中实现了高召回率的最先进视觉地点识别性能。
English Summary: SAGE introduces a unified training pipeline that enhances visual place recognition by dynamically integrating spatial context and visual similarity through adaptive graph exploration and hard sample mining, achieving state-of-the-art results across multiple benchmarks with high recall rates.

Authors:Dengming Zhang, Xiaowen Ma, Zhenliang Ni, Zhenkai Wu, Han Shu, Xin Jiang, Xinghao Chen
Title: Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking
Abstract:
Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged model's hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. The source code is available at https://github.com/Littleor/ExpertMerging.
中文: Expert Merging是一种轻量训练方法,利用无标注数据优化分层系数,对齐隐藏状态和逻辑值以实现高效的多专家模型融合,而Expert Merging++通过重要性引导的分块策略进一步提升性能,在LLM和MLLM上表现优异。
English: Expert Merging is a training-light method that uses unlabeled data to optimize layer-wise coefficients, aligning hidden states and logits for efficient multi-expert model merging, with Expert Merging++ enhancing it through importance-guided chunking to improve performance across LLMs and MLLMs.

Authors:Tingyu Shi, Fan Lyu, Shaoliang Peng
Title: Annotation-Efficient Active Test-Time Adaptation with Conformal Prediction
Abstract:
Active Test-Time Adaptation (ATTA) improves model robustness under domain shift by selectively querying human annotations at deployment, but existing methods use heuristic uncertainty measures and suffer from low data selection efficiency, wasting human annotation budget. We propose Conformal Prediction Active TTA (CPATTA), which first brings principled, coverage-guaranteed uncertainty into ATTA. CPATTA employs smoothed conformal scores with a top-K certainty measure, an online weight-update algorithm driven by pseudo coverage, a domain-shift detector that adapts human supervision, and a staged update scheme balances human-labeled and model-labeled data. Extensive experiments demonstrate that CPATTA consistently outperforms the state-of-the-art ATTA methods by around 5% in accuracy. Our code and datasets are available at https://github.com/tingyushi/CPATTA.
中文摘要:CPATTA采用基于保形预测的理论框架改进主动测试时适应方法,通过优化的不确定性度量与自适应标注策略,在多项实验中比现有最优方法准确率提升约5%。
English Summary: CPATTA introduces a principled conformal prediction framework to enhance active test-time adaptation, achieving approximately 5% higher accuracy than existing methods through improved uncertainty measurement and adaptive human annotation strategies.

Authors:Kaiyu Li, Zixuan Jiang, Xiangyong Cao, Jiayu Wang, Yuchen Xiao, Deyu Meng, Zhi Wang
Title: DescribeEarth: Describe Anything for Remote Sensing Images
Abstract:
Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.
中文摘要:本研究提出了Geo-DLC这一针对遥感图像的细粒度目标级描述新任务,构建了DE-Dataset数据集并建立了DE-Benchmark评估体系,所设计的DescribeEarth模型在准确性、丰富性和语法规范性上均优于现有先进多模态大语言模型。
English Summary: The study introduces Geo-DLC, a novel object-level fine-grained captioning task for remote sensing images, supported by the DE-Dataset and evaluated through the DE-Benchmark, with the proposed DescribeEarth model demonstrating superior performance in accuracy and detail over existing methods.

Authors:Gihan Panapitiya, Emily Saldanha, Heather Job, Olivia Hess
Title: AutoLabs: Cognitive Multi-Agent Systems with Self-Correction for Autonomous Chemical Experimentation
Abstract:
The automation of chemical research through self-driving laboratories (SDLs) promises to accelerate scientific discovery, yet the reliability and granular performance of the underlying AI agents remain critical, under-examined challenges. In this work, we introduce AutoLabs, a self-correcting, multi-agent architecture designed to autonomously translate natural-language instructions into executable protocols for a high-throughput liquid handler. The system engages users in dialogue, decomposes experimental goals into discrete tasks for specialized agents, performs tool-assisted stoichiometric calculations, and iteratively self-corrects its output before generating a hardware-ready file. We present a comprehensive evaluation framework featuring five benchmark experiments of increasing complexity, from simple sample preparation to multi-plate timed syntheses. Through a systematic ablation study of 20 agent configurations, we assess the impact of reasoning capacity, architectural design (single- vs. multi-agent), tool use, and self-correction mechanisms. Our results demonstrate that agent reasoning capacity is the most critical factor for success, reducing quantitative errors in chemical amounts (nRMSE) by over 85% in complex tasks. When combined with a multi-agent architecture and iterative self-correction, AutoLabs achieves near-expert procedural accuracy (F1-score > 0.89) on challenging multi-step syntheses. These findings establish a clear blueprint for developing robust and trustworthy AI partners for autonomous laboratories, highlighting the synergistic effects of modular design, advanced reasoning, and self-correction to ensure both performance and reliability in high-stakes scientific applications. Code: https://github.com/pnnl/autolabs
中文:AutoLabs提出了一种自我修正的多智能体系统,可将自然语言指令转化为可执行的实验流程,其高级推理和模块化设计使复杂任务的定量误差降低超85%,并在多步合成中实现接近专家水平的精确度。
English: AutoLabs introduces a self-correcting multi-agent system that translates natural language into executable lab protocols, with advanced reasoning and modular design reducing quantitative errors by over 85% and achieving near-expert accuracy in complex syntheses.

Authors:Shangqi Gao, Sihan Wang, Yibo Gao, Boming Wang, Xiahai Zhuang, Anne Warren, Grant Stewart, James Jones, Mireia Crispin-Ortuzar
Title: Evaluating Foundation Models with Pathological Concept Learning for Kidney Cancer
Abstract:
To evaluate the translational capabilities of foundation models, we develop a pathological concept learning approach focused on kidney cancer. By leveraging TNM staging guidelines and pathology reports, we build comprehensive pathological concepts for kidney cancer. Then, we extract deep features from whole slide images using foundation models, construct pathological graphs to capture spatial correlations, and trained graph neural networks to identify these concepts. Finally, we demonstrate the effectiveness of this approach in kidney cancer survival analysis, highlighting its explainability and fairness in identifying low- and high-risk patients. The source code has been released by https://github.com/shangqigao/RadioPath.
中文: 本研究通过结合TNM分期与基础模型分析全切片图像,开发了一种肾癌病理概念学习方法,在生存预测中展现出更好的可解释性与公平性。
English: This study develops a pathological concept learning method for kidney cancer by integrating TNM staging with foundation models to analyze whole slide images, demonstrating enhanced survival prediction with improved explainability and fairness.

Authors:Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao
Title: Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play
Abstract:
Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision-language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic framework enabling VLM self-improvement through competitive visual games generated from arbitrary image pairs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code has been released at https://github.com/wangqinsi1/Vision-Zero.
中文摘要:Vision-Zero 是一种领域无关的框架,通过任意图像对生成竞争性视觉游戏,使视觉语言模型能够实现自我优化,无需人工标注即可在多项推理任务中达到最先进性能。
English Summary: Vision-Zero is a domain-agnostic framework that enables vision-language models to self-improve through competitive visual games generated from arbitrary image pairs, eliminating the need for manual annotation while achieving state-of-the-art performance across multiple reasoning tasks.

Authors:Victor Wang, Elias Stengel-Eskin
Title: Calibrating Verbalized Confidence with Self-Generated Distractors
Abstract:
Calibrated confidence estimates are necessary for large language model (LLM) outputs to be trusted by human users. While LLMs can express their confidence in human-interpretable ways, verbalized LLM-generated confidence scores have empirically been found to be miscalibrated, reporting high confidence on instances with low accuracy and thereby harming trust and safety. We hypothesize that this overconfidence often stems from a given LLM's heightened suggestibility when faced with claims that it encodes little information about; we empirically validate this hypothesis, finding more suggestibility on lower-accuracy claims. Building on this finding, we introduce Distractor-Normalized Coherence (DINCO), which estimates and accounts for an LLM's suggestibility bias by having the model verbalize its confidence independently across several self-generated distractors (i.e. alternative claims), and normalizes by the total verbalized confidence. To further improve calibration, we leverage generator-validator disagreement, augmenting normalized validator confidence with a consistency-based estimate of generator confidence. Here, we frame the popular approach of self-consistency as leveraging coherence across sampled generations, and normalized verbalized confidence as leveraging coherence across validations on incompatible claims, allowing us to integrate these complementary dimensions of coherence into DINCO. Moreover, our analysis shows that DINCO provides less saturated -- and therefore more usable -- confidence estimates, and that further sampling alone cannot close the gap between DINCO and baselines, with DINCO at 10 inference calls outperforming self-consistency at 100.
中文: 校准的置信度估计对于大语言模型输出的可信度至关重要,而提出的DINCO方法通过标准化模型自生成干扰项的口头置信度,并利用生成器-验证器分歧来提高准确性和可用性,从而解决了误校准问题。
English: Calibrated confidence estimates are crucial for trustworthy LLM outputs, and the proposed DINCO method addresses miscalibration by normalizing verbalized confidence across self-generated distractors and leveraging generator-validator disagreement to improve accuracy and usability.

Authors:Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev
Title: MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Abstract:
We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
中文: MixtureVitae是一个开放获取的预训练语料库,采用风险缓和的来源策略和透明处理流程,在降低法律风险的同时实现了优越的模型性能,在多项基准测试中持续超越其他许可数据集。
English: MixtureVitae is an open-access pretraining corpus designed to minimize legal risks while delivering strong model performance through a risk-mitigated sourcing strategy and transparent curation pipeline, consistently outperforming other permissive datasets in benchmarks.

Authors:Alexander Kovrigin, Aleksandra Eliseeva, Konstantin Grotov, Egor Bogomolov, Yaroslav Zharov
Title: PIPer: On-Device Environment Setup via Online Reinforcement Learning
Abstract:
Environment setup-the process of configuring the system to work with a specific software project-represents a persistent challenge in Software Engineering (SE). Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort. This also helps SE researchers to scale execution-based benchmarks. However, recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task. To address this limitation, we tune a specialized model for environment setup. We combine supervised fine-tuning for generating correct Bash scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup. On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4o. The training code and model checkpoints are available online: https://github.com/JetBrains-Research/PIPer.
Chinese: 我们通过监督微调和强化学习专门优化的模型,使轻量级Qwen3-8B在EnvBench-Python基准测试中达到了与Qwen3-32B和GPT-4o等大型模型相当的环境配置性能。
English: Our specialized model, fine-tuned with supervised learning and reinforcement learning for automated environment setup, enables the compact Qwen3-8B to match the performance of larger models like Qwen3-32B and GPT-4o on EnvBench-Python.

Authors:Hanyuan Gao, Xiaoxuan Yang
Title: Norm-Q: Effective Compression Method for Hidden Markov Models in Neuro-Symbolic Applications
Abstract:
Hidden Markov models (HMM) are commonly used in generation tasks and have demonstrated strong capabilities in neuro-symbolic applications for the Markov property. These applications leverage the strengths of neural networks and symbolic reasoning to create robust and interpretable AI systems. However, they may inherit and amplify the shortcomings of both approaches. Both components require dense computation and data transfer, and their communication further hinders performance. This paper proposes Norm-Q, a normalized linear quantization approach for compressing probabilistic symbolic models, such as HMMs. We reduce the bit width of the data with minimal impact, thereby alleviating memory and bandwidth stress and enabling deployment on potential custom hardware. Our method introduces a normalized quantization-aware expectation maximization process for probabilistic model training. The experimental results show that Norm-Q achieves a higher compression rate with reasonable score loss compared to traditional quantization methods. In the case of the constrained generation task of large language models, we successfully quantize an HMM of 4096 hidden states to 8 bits without loss and, at most, 3 bits with acceptable loss. Notably, the Norm-Q method can achieve a compression rate of 99% for the weights of the HMM. The code is open source at https://github.com/superstarghy/Norm-Q.
中文摘要:本文提出Norm-Q归一化线性量化方法,通过降低概率符号模型(如隐马尔可夫模型)的数据位宽,在保证精度的同时实现了高达99%的权重压缩率,有效缓解了内存与带宽压力。
English Summary: This paper introduces Norm-Q, a normalized linear quantization method that effectively compresses probabilistic symbolic models like HMMs by reducing data bit width with minimal performance impact, achieving up to 99% weight compression while maintaining acceptable accuracy.

Authors:Zhibo Hou, Zhiyu An, Wan Du
Title: Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring
Abstract:
When there exists an unlearnable source of randomness (noisy-TV) in the environment, a naively intrinsic reward driven exploring agent gets stuck at that source of randomness and fails at exploration. Intrinsic reward based on uncertainty estimation or distribution similarity, while eventually escapes noisy-TVs as time unfolds, suffers from poor sample efficiency and high computational cost. Inspired by recent findings from neuroscience that humans monitor their improvements during exploration, we propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM). During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions rather than the unlearnable transitions. We introduce a dual-network design that uses an error model to predict the expected prediction error of the dynamics model in its previous iteration, and use the difference between the model errors of the current iteration and previous iteration to guide exploration. We theoretically show that the intrinsic reward of LPM is zero-equivariant and a monotone indicator of Information Gain (IG), and that the error model is necessary to achieve monotonicity correspondence with IG. We empirically compared LPM against state-of-the-art baselines in noisy environments based on MNIST, 3D maze with 160x120 RGB inputs, and Atari. Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari. This conceptually simple approach marks a shift-of-paradigm of noise-robust exploration. For code to reproduce our experiments, see https://github.com/Akuna23Matata/LPM_exploration
Chinese: 提出的学习进度监控(LPM)方法通过奖励模型改进而非预测误差,有效避免了不可学习噪声的干扰,在嘈杂环境中实现了更快的收敛速度和更优的性能表现。
English: The proposed Learning Progress Monitoring (LPM) method improves exploration efficiency by rewarding model improvements instead of prediction errors, effectively avoiding distractions from unlearnable noise while achieving faster convergence and better performance in noisy environments.

Authors:Ana Paula Gomes Ferreira, Aleksandar Anžel, Izabel Oliva Marcilio de Souza, Helen Hughes, Alex J Elliot, Jude Dzevela Kong, Madlen Schranz, Alexander Ullrich, Georges Hattab
Title: The Open Syndrome Definition
Abstract:
Case definitions are essential for effectively communicating public health threats. However, the absence of a standardized, machine-readable format poses significant challenges to interoperability, epidemiological research, the exchange of qualitative data, and the effective application of computational analysis methods, including artificial intelligence (AI). This complicates comparisons and collaborations across organizations and regions, limits data integration, and hinders technological innovation in public health. To address these issues, we propose the first open, machine-readable format for representing case and syndrome definitions. Additionally, we introduce the first comprehensive dataset of standardized case definitions and tools to convert existing human-readable definitions into machine-readable formats. We also provide an accessible online platform for browsing, analyzing, and contributing new definitions, available at https://opensyndrome.org. The Open Syndrome Definition format enables consistent, scalable use of case definitions across systems, unlocking AI's potential to strengthen public health preparedness and response. The source code for the format can be found at https://github.com/OpenSyndrome/schema under the MIT license.
中文摘要:Open Syndrome Definition格式作为首个机器可读的病例定义开放标准,通过在线平台和配套工具解决了公共卫生数据互操作性难题,为人工智能应用和跨系统协作提供了技术基础。
English Summary: The proposed Open Syndrome Definition format addresses interoperability challenges by introducing the first machine-readable standard for case definitions, enabling AI applications and data integration in public health through an accessible online platform.

Authors:Hao Ban, Kaiyi Ji
Title: Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs
Abstract:
Large language models are often adapted using parameter-efficient techniques such as Low-Rank Adaptation (LoRA), formulated as $y = W_0x + BAx$, where $W_0$ is the pre-trained parameters and $x$ is the input to the adapted layer. While multi-adapter extensions often employ multiple LoRAs, prior studies suggest that the inner $A$ matrices are highly similar during training and thus suitable for sharing. We revisit this phenomenon and find that this similarity is largely attributable to the identical initialization rather than shared knowledge, with $B$ playing a more critical role in knowledge encoding and transfer. Motivated by these insights, we propose \textbf{ALoRA}, an asymmetric multi-LoRA design with multiple $A$ matrices and a single shared $B$ in multi-task fine-tuning, and \textbf{Fed-ALoRA}, which shares $B$ across clients in federated fine-tuning under both homogeneous and heterogeneous settings, through a novel matrix decomposition strategy to accommodate heterogeneous ranks across clients. Experiments on commonsense reasoning, math reasoning, multi-task NLP dataset, and federated NLP dataset demonstrate that our methods achieve more balanced performance across tasks with comparable or superior average accuracy relative to existing multi-LoRA approaches. Codes are available at https://github.com/OptMN-Lab/ALoRA.
中文: 该研究重新审视了LoRA内部矩阵的相似性,提出了ALoRA和Fed-ALoRA两种非对称设计方法,通过共享B矩阵在多任务和联邦微调中实现了更均衡且优越的性能,相关代码已开源。
English: The study revisits the similarity in LoRA's inner matrices and proposes ALoRA and Fed-ALoRA, which use asymmetric designs with shared B matrices, achieving balanced and superior performance in multi-task and federated fine-tuning across various reasoning and NLP tasks.

Authors:Zewei Zhang, Huan Liu, Yuanhao Yu, Jun Chen, Xiangyu Xu
Title: Boolean Satisfiability via Imitation Learning
Abstract:
We propose ImitSAT, a branching policy for conflict-driven clause learning (CDCL) solvers based on imitation learning for the Boolean satisfiability problem (SAT). Unlike previous methods that predict instance-level signals to improve CDCL branching indirectly, or rely on reinforcement learning and insufficient CDCL information to enhance branching, ImitSAT learns from expert KeyTrace that collapses a full run into the sequence of surviving decisions. Replaying a KeyTrace on the same instance is nearly conflict-free, providing dense decision-level supervision and directly reducing propagations -- the dominant contributor to wall-clock time. This prefix-conditioned supervision enables ImitSAT to reproduce high-quality branches without exploration, yielding faster convergence, stable training, and seamless integration into CDCL. Extensive experiments demonstrate that ImitSAT reduces propagation counts and runtime, outperforming state-of-the-art learned approaches. We released the source code and trained model at https://github.com/zewei-Zhang/ImitSAT
中文: ImitSAT是一种基于模仿学习的新型CDCL求解器分支策略,通过专家KeyTrace提供密集的决策级监督,直接减少传播次数和运行时间,性能优于现有最优学习方法。
English: ImitSAT is a novel branching policy for CDCL SAT solvers that uses imitation learning from expert KeyTraces to provide dense decision-level supervision, directly reducing propagations and runtime while outperforming state-of-the-art methods.

Authors:S. Sandra Bae, Takanori Fujiwara, Danielle Albers Szafir, Ellen Yi-Luen Do, Michael L. Rivera
Title: Computational Design and Single-Wire Sensing of 3D Printed Objects with Integrated Capacitive Touchpoints
Abstract:
Producing interactive 3D printed objects currently requires laborious 3D design and post-instrumentation with off-the-shelf electronics. Multi-material 3D printing using conductive PLA presents opportunities to mitigate these challenges. We present a computational design pipeline that embeds multiple capacitive touchpoints into any 3D model that has a closed mesh without self-intersection. With our pipeline, users define touchpoints on the 3D object's surface to indicate interactive regions. Our pipeline then automatically generates a conductive path to connect the touch regions. This path is optimized to output unique resistor-capacitor delays when each region is touched, resulting in all regions being able to be sensed through a double-wire or single-wire connection. We illustrate our approach's utility with five computational and sensing performance evaluations (achieving 93.35% mean accuracy for single-wire) and six application examples. Our sensing technique supports existing uses (e.g., prototyping) and highlights the growing promise to produce interactive devices entirely with 3D printing. Project website: https://github.com/d-rep-lab/3dp-singlewire-sensing
中文: 本研究提出一种计算设计流程,可在三维模型中嵌入电容式触摸点,通过优化导电路径和单线高精度传感实现交互功能,推动了全3D打印交互设备的创新发展。
English: This study introduces a computational design pipeline that embeds capacitive touchpoints into 3D models, enabling interactive objects through optimized conductive paths and single-wire sensing with high accuracy, advancing fully 3D printed interactive devices.

Authors:Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, Jiaxuan You
Title: Where LLM Agents Fail and How They can Learn From Failures
Abstract:
Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug
中文: 大语言模型智能体因错误级联传播面临系统性失效,而提出的AgentDebug框架通过错误分类体系、基准数据集和调试工具显著提升了任务准确率,并能实现故障的迭代修复。
English: Large Language Model agents face cascading failures due to error propagation, but the proposed AgentDebug framework with its error taxonomy, benchmark dataset, and debugging tools significantly improves task accuracy and enables iterative recovery from failures.

Authors:Daniel Platnick, Mohamed E. Bengueddache, Marjan Alirezaie, Dava J. Newman, Alex ''Sandy'' Pentland, Hossein Rahnama
Title: ID-RAG: Identity Retrieval-Augmented Generation for Long-Horizon Persona Coherence in Generative Agents
Abstract:
Generative agents powered by language models are increasingly deployed for long-horizon tasks. However, as long-term memory context grows over time, they struggle to maintain coherence. This deficiency leads to critical failures, including identity drift, ignoring established beliefs, and the propagation of hallucinations in multi-agent systems. To mitigate these challenges, this paper introduces Identity Retrieval-Augmented Generation (ID-RAG), a novel mechanism designed to ground an agent's persona and persistent preferences in a dynamic, structured identity model: a knowledge graph of core beliefs, traits, and values. During the agent's decision loop, this model is queried to retrieve relevant identity context, which directly informs action selection. We demonstrate this approach by introducing and implementing a new class of ID-RAG enabled agents called Human-AI Agents (HAis), where the identity model is inspired by the Chronicle structure used in Perspective-Aware AI, a dynamic knowledge graph learned from a real-world entity's digital footprint. In social simulations of a mayoral election, HAis using ID-RAG outperformed baseline agents in long-horizon persona coherence - achieving higher identity recall across all tested models by the fourth timestep - and reduced simulation convergence time by 19% (GPT-4o) and 58% (GPT-4o mini). By treating identity as an explicit, retrievable knowledge structure, ID-RAG offers a foundational approach for developing more temporally coherent, interpretable, and aligned generative agents. Our code is open-source and available at: https://github.com/flybits/humanai-agents.
中文摘要:本文提出身份检索增强生成(ID-RAG)机制,通过动态知识图谱保持生成式智能体在长期任务中的人格一致性,显著提升身份记忆能力并缩短仿真时间。
English Summary: This paper introduces Identity Retrieval-Augmented Generation (ID-RAG), a novel mechanism that uses a dynamic knowledge graph to maintain generative agents' persona coherence during long-term tasks, significantly improving identity recall and reducing simulation time.

Authors:Jun Kawasaki
Title: ActorDB: A Unified Database Model Integrating Single-Writer Actors, Incremental View Maintenance, and Zero-Trust Messaging
Abstract:
This paper presents ActorDB ( Dekigoto ) , a novel database architecture that tightly integrates a single-writer actor model for writes, Incremental View Maintenance (IVM), and a zero-trust security model as a core component. The primary contribution of this work is the unification of these powerful but complex concepts into a single, cohesive system designed to reduce architectural complexity for developers of modern, data-intensive applications. We argue that by providing these capabilities out-of-the-box, ActorDB can offer a more robust, secure, and developer-friendly platform compared to solutions that require manual integration of separate systems for actor persistence, stream processing, and security. We present the core architecture, discuss the critical trade-offs in its design, and define the performance criteria for a Minimum Viable Product (MVP) to validate our approach.
中文: ActorDB是一种新型数据库架构,它将单写入者参与者模型、增量视图维护和零信任安全模型整合为统一系统,旨在简化数据密集型应用的开发复杂性。
English: ActorDB is a new database architecture that combines a single-writer actor model, Incremental View Maintenance, and zero-trust security into a unified system to simplify development for data-intensive applications.

Authors:Yingming Pu, Tao Lin, Hongyu Chen
Title: Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis
Abstract:
The capacity of Large Language Models (LLMs) to generate valid scientific hypotheses for materials synthesis remains largely unquantified, hindered by the absence of benchmarks probing physicochemical logics reasoning. To address this, we introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains. Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles. We demonstrate that our proposed principle-aware prompting methodology substantially outperforms standard Chain-of-Thought, enhancing both hypothesis accuracy and computational efficiency. This work provides a methodological framework to advance LLMs toward reliable scientific hypothesis generation in materials science. The MatterMech benchmark and associated code is publicly available at \href{https://github.com/amair-lab/MatterMech}{GitHub}.
中文: 本研究推出MatterMech基准测试,揭示大语言模型在材料科学假设生成中缺乏物化原理支撑的问题,并提出一种原理感知提示方法,显著提升了假设准确性与计算效率。
English: This study introduces MatterMech, a benchmark that reveals LLMs' deficiency in grounding scientific hypotheses in physicochemical principles, and proposes a principle-aware prompting method that significantly improves hypothesis accuracy and efficiency in materials science.

Authors:Liangjian Wen, Qun Dai, Jianzhuang Liu, Jiangtao Zheng, Yong Dai, Dongkai Wang, Zhao Kang, Jun Wang, Zenglin Xu, Jiang Duan
Title: InfMasking: Unleashing Synergistic Information by Contrastive Multimodal Interactions
Abstract:
In multimodal representation learning, synergistic interactions between modalities not only provide complementary information but also create unique outcomes through specific interaction patterns that no single modality could achieve alone. Existing methods may struggle to effectively capture the full spectrum of synergistic information, leading to suboptimal performance in tasks where such interactions are critical. This is particularly problematic because synergistic information constitutes the fundamental value proposition of multimodal representation. To address this challenge, we introduce InfMasking, a contrastive synergistic information extraction method designed to enhance synergistic information through an Infinite Masking strategy. InfMasking stochastically occludes most features from each modality during fusion, preserving only partial information to create representations with varied synergistic patterns. Unmasked fused representations are then aligned with masked ones through mutual information maximization to encode comprehensive synergistic information. This infinite masking strategy enables capturing richer interactions by exposing the model to diverse partial modality combinations during training. As computing mutual information estimates with infinite masking is computationally prohibitive, we derive an InfMasking loss to approximate this calculation. Through controlled experiments, we demonstrate that InfMasking effectively enhances synergistic information between modalities. In evaluations on large-scale real-world datasets, InfMasking achieves state-of-the-art performance across seven benchmarks. Code is released at https://github.com/brightest66/InfMasking.
Chinese: InfMasking提出了一种无限掩码策略,通过在融合过程中随机遮蔽模态特征并利用互信息最大化对齐掩码表示,有效增强了模态间的协同信息,在七个基准测试中取得了最优性能。
English: InfMasking introduces an infinite masking strategy in multimodal learning that stochastically occludes modality features during fusion and aligns masked representations through mutual information maximization, achieving state-of-the-art performance across seven benchmarks by enhancing synergistic interactions.

Authors:Aayush Gupta
Title: Fact Grounded Attention: Eliminating Hallucination in Large Language Models Through Attention Level Knowledge Integration
Abstract:
"The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge." Large Language Models have conquered natural language but remain prisoners of their own probabilistic nature--confidently hallucinating facts they never truly knew. We present Fact Grounded Attention (FGA), a novel architectural modification that transforms unreliable language models into deterministic truth tellers by injecting verifiable knowledge directly into the attention mechanism. Unlike existing approaches that patch hallucinations after generation or prepend retrieved text, FGA intervenes at the mathematical heart of the transformer--the pre-softmax attention scores--creating a model that cannot hallucinate when facts exist in its knowledge base. Our experiments across 1,107 technical queries spanning smartphones, laptops, and electric vehicles demonstrate a transformation from 6.3% accuracy in vanilla Llama 3.2 to 99.7% accuracy with FGA. More critically, knowledge updates occur in under one second without retraining, compared to hours for parameter editing approaches. FGA doesn't just reduce hallucination--it eliminates it entirely for verifiable facts, marking a fundamental shift from probabilistic approximation to deterministic precision in neural language generation.
This paper introduces Fact Grounded Attention (FGA), a novel transformer modification that eliminates hallucinations by injecting verifiable knowledge directly into attention mechanisms, achieving 99.7% accuracy and enabling instant knowledge updates without retraining.
English Summary:

Authors:Kevin Xu, Issei Sato
Title: A Formal Comparison Between Chain-of-Thought and Latent Thought
Abstract:
Chain-of-Thought (CoT) elicits reasoning in large language models by explicitly generating intermediate steps in natural language. In contrast, Latent Thought in looped models operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that Latent Thought in Looped Transformers enables parallel computation, which is more efficient than the inherently sequential process of CoT. In contrast, CoT leverages stochastic decoding to approximate solutions to problems where exact computation is intractable. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms. Code is available at https://github.com/kevin671/cot-vs-loop.
中文摘要:循环变换器中的潜在思维支持高效的并行计算,而思维链则采用序列推理和随机解码处理难解问题,为选择不同推理范式提供了实用指导。
English Summary: Latent Thought in looped transformers enables efficient parallel computation, while Chain-of-Thought uses sequential reasoning with stochastic decoding for intractable problems, providing guidance for choosing between these reasoning paradigms.

Authors:Xiaojian Wang, Chaoli Zhang, Zhonglong Zheng, Yunliang Jiang
Title: WDformer: A Wavelet-based Differential Transformer Model for Time Series Forecasting
Abstract:
Time series forecasting has various applications, such as meteorological rainfall prediction, traffic flow analysis, financial forecasting, and operational load monitoring for various systems. Due to the sparsity of time series data, relying solely on time-domain or frequency-domain modeling limits the model's ability to fully leverage multi-domain information. Moreover, when applied to time series forecasting tasks, traditional attention mechanisms tend to over-focus on irrelevant historical information, which may introduce noise into the prediction process, leading to biased results. We proposed WDformer, a wavelet-based differential Transformer model. This study employs the wavelet transform to conduct a multi-resolution analysis of time series data. By leveraging the advantages of joint representation in the time-frequency domain, it accurately extracts the key information components that reflect the essential characteristics of the data. Furthermore, we apply attention mechanisms on inverted dimensions, allowing the attention mechanism to capture relationships between multiple variables. When performing attention calculations, we introduced the differential attention mechanism, which computes the attention score by taking the difference between two separate softmax attention matrices. This approach enables the model to focus more on important information and reduce noise. WDformer has achieved state-of-the-art (SOTA) results on multiple challenging real-world datasets, demonstrating its accuracy and effectiveness. Code is available at https://github.com/xiaowangbc/WDformer.
中文:提出的WDformer模型采用小波变换进行多分辨率时频分析,并引入差分注意力机制以更好地提取关键信息并降低噪声,在多个真实数据集上实现了最优性能。
English: The proposed WDformer model utilizes wavelet transform for multi-resolution time-frequency analysis and introduces a differential attention mechanism to better capture key information while reducing noise, achieving state-of-the-art performance across multiple real-world datasets.

Authors:Long Xu, Yongcai Chen, Fengshuo Liu, Yuzhong Peng
Title: MSCoD: An Enhanced Bayesian Updating Framework with Multi-Scale Information Bottleneck and Cooperative Attention for Structure-Based Drug Design
Abstract:
Structure-Based Drug Design (SBDD) is a powerful strategy in computational drug discovery, utilizing three-dimensional protein structures to guide the design of molecules with improved binding affinity. However, capturing complex protein-ligand interactions across multiple scales remains challenging, as current methods often overlook the hierarchical organization and intrinsic asymmetry of these interactions. To address these limitations, we propose MSCoD, a novel Bayesian updating-based generative framework for structure-based drug design. In our MSCoD, Multi-Scale Information Bottleneck (MSIB) was developed, which enables semantic compression at multiple abstraction levels for efficient hierarchical feature extraction. Furthermore, a multi-head cooperative attention (MHCA) mechanism was developed, which employs asymmetric protein-to-ligand attention to capture diverse interaction types while addressing the dimensionality disparity between proteins and ligands. Empirical studies showed that MSCoD outperforms state-of-the-art methods on the benchmark dataset. Case studies on challenging targets such as KRAS G12D further demonstrate its applicability in real-world scenarios. The code and data underlying this article are freely available at https://github.com/xulong0826/MSCoD.
中文摘要:研究者提出MSCoD这一基于贝叶斯更新的生成框架,通过多尺度特征提取和不对称注意力机制改进基于结构的药物设计方法,在基准测试和实际案例中均展现出优于现有技术的性能。
English Summary: The authors introduce MSCoD, a Bayesian generative framework that enhances structure-based drug design by employing multi-scale feature extraction and asymmetric attention mechanisms to better model complex protein-ligand interactions, demonstrating superior performance over existing methods.

Authors:Guillermo Comesaña Cimadevila
Title: Evaluating Double Descent in Machine Learning: Insights from Tree-Based Models Applied to a Genomic Prediction Task
Abstract:
Classical learning theory describes a well-characterised U-shaped relationship between model complexity and prediction error, reflecting a transition from underfitting in underparameterised regimes to overfitting as complexity grows. Recent work, however, has introduced the notion of a second descent in test error beyond the interpolation threshold-giving rise to the so-called double descent phenomenon. While double descent has been studied extensively in the context of deep learning, it has also been reported in simpler models, including decision trees and gradient boosting. In this work, we revisit these claims through the lens of classical machine learning applied to a biological classification task: predicting isoniazid resistance in Mycobacterium tuberculosis using whole-genome sequencing data. We systematically vary model complexity along two orthogonal axes-learner capacity (e.g., Pleaf, Pboost) and ensemble size (i.e., Pens)-and show that double descent consistently emerges only when complexity is scaled jointly across these axes. When either axis is held fixed, generalisation behaviour reverts to classical U- or L-shaped patterns. These results are replicated on a synthetic benchmark and support the unfolding hypothesis, which attributes double descent to the projection of distinct generalisation regimes onto a single complexity axis. Our findings underscore the importance of treating model complexity as a multidimensional construct when analysing generalisation behaviour. All code and reproducibility materials are available at: https://github.com/guillermocomesanacimadevila/Demystifying-Double-Descent-in-ML.
中文: 本研究表明机器学习中的双重下降现象仅在同时调节学习器容量和集成规模时出现,当任一维度固定时泛化行为会恢复为经典U型或L型模式,揭示了模型复杂度的多维本质。
English: This study demonstrates that the double descent phenomenon in machine learning emerges only when model complexity is scaled jointly across learner capacity and ensemble size, reverting to classical U- or L-shaped patterns when either dimension is held constant, highlighting the multidimensional nature of model complexity.

Authors:Shuoshuo Zhang, Zijian Li, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Jun Zhang, Yujiu Yang, Rui Wang
Title: PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images
Abstract:
Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however, existing cue-based methods are constrained with low-fidelity image processing and linear, rigid reasoning patterns, limiting their effectiveness on complex structured-image tasks. In this paper, we propose PixelCraft, a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images. The system comprises a dispatcher, a planner, a reasoner, critics, and a set of visual tool agents. To achieve high-fidelity processing, we construct a high-quality corpus and fine-tune an MLLM into a grounding model, whose pixel-level localizations are integrated with traditional computer vision (CV) algorithms in tool agents. Building on this foundation, PixelCraft facilitates flexible visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism. Moreover, unlike prior linear reasoning patterns that simply append historical images, PixelCraft maintains an image memory to allow the planner to adaptively revisit earlier visual steps, explore alternative reasoning branches, and dynamically adjust the reasoning trajectory during discussion. Extensive experiments on challenging chart and geometry benchmarks demonstrate that PixelCraft significantly improves visual reasoning performance for advanced MLLMs, setting a new standard for structured image reasoning. Our code will be available at https://github.com/microsoft/PixelCraft.
中文: PixelCraft是一种创新的多智能体系统,通过结合高保真像素级定位与计算机视觉算法,并采用动态工作流程实现灵活自适应的推理,显著提升了结构化图像上的视觉推理性能。
English: PixelCraft is a novel multi-agent system that enhances visual reasoning on structured images by integrating high-fidelity pixel-level localization with computer vision algorithms and employing a dynamic workflow for flexible, adaptive reasoning, significantly outperforming existing methods on complex tasks.

Authors:Junyu Chen, Wenkun He, Yuchao Gu, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai
Title: DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
Abstract:
We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU. Code: https://github.com/dc-ai-projects/DC-VideoGen.
中文: DC-VideoGen是一种后训练加速框架,通过轻量级微调将预训练模型适配到深度压缩的潜空间,显著提升视频生成效率,推理延迟降低高达14.8倍且不损失质量。
English: DC-VideoGen is a post-training acceleration framework that enhances video generation efficiency by adapting pre-trained models to a compressed latent space through lightweight fine-tuning, achieving up to 14.8x faster inference without quality loss.

Authors:Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai
Title: DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space
Abstract:
Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model's latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model's inherent generation quality. We verify DC-Gen's effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.
中文:DC-Gen通过轻量级嵌入对齐和少量LoRA微调,利用深度压缩的潜在空间加速文本到图像扩散模型,在保持与基础模型相当图像质量的同时实现显著加速。
English: DC-Gen accelerates text-to-image diffusion models by leveraging a deeply compressed latent space through lightweight embedding alignment and minimal LoRA fine-tuning, achieving significant speedups while maintaining image quality comparable to base models.

Authors:Bingkui Tong, Jiaer Xia, Kaiyang Zhou
Title: Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding
Abstract:
Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations -- generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .
Chinese: 多模态大语言模型常出现幻觉问题,而提出的层对比解码方法通过对比浅层与深层视觉特征,有效减少幻觉,显著提升了输出的准确性。
English: Multimodal Large Language Models often produce hallucinations, but the proposed Layer Contrastive Decoding method effectively mitigates this by contrasting shallow and deep visual features to enhance output accuracy.

Authors:Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang, Weiming Lu, Yueting Zhuang, Yongliang Shen
Title: EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering
Abstract:
Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time through targeted manipulation of hidden states, offering a lightweight alternative to expensive retraining. However, existing steering frameworks suffer from critical limitations: computational inefficiency, limited extensibility, and restricted functionality that hinder both research progress and practical deployment. We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM. Our system features modular architecture with pluggable interfaces for both analysis-based and learning-based methods, fine-grained parameter control, pre-computed steering vectors for eight application domains, and an interactive demonstration system. Through deep integration with vLLM's optimized inference engine, EasySteer achieves 5.5-11.4$\times$ speedup over existing frameworks. Extensive experiments demonstrate its effectiveness in overthinking mitigation, hallucination reduction, and other key applications. EasySteer transforms steering from research technique to production-ready capability, establishing critical infrastructure for deployable, controllable language models.
中文摘要:EasySteer是一个基于vLLM构建的高性能、可扩展大语言模型引导框架,通过与优化推理引擎深度集成实现显著加速,并具备模块化架构和多领域应用能力。
English Summary: EasySteer is a high-performance, extensible framework for LLM steering that achieves significant speed improvements and broad application effectiveness through deep integration with vLLM.

Authors:Fan Yuan, Yuchen Yan, Yifan Jiang, Haoran Zhao, Tao Feng, Jinyan Chen, Yanwei Lou, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts
Abstract:
Vision language models (VLMs) achieve unified modeling of images and text, enabling them to accomplish complex real-world tasks through perception, planning, and reasoning. Among these tasks, reasoning is particularly representative, with mathematical reasoning serving as a prominent example. It highlights the high-level capability of VLMs to comprehend mathematical information in images and to perform sophisticated reasoning. Recently, numerous visual mathematical reasoning benchmarks have been proposed, but they are often restricted to geometry, lack coverage of math word problems, and rarely assess reasoning across multiple images. To address these gaps, we introduce GSM8K-V, a purely visual multi-image mathematical reasoning benchmark. GSM8K-V is built by systematically mapping each sample from the widely used text-based GSM8K into visual form. Through a carefully designed automated image-generation pipeline combined with meticulous human annotation, we curate 1,319 high-quality samples. We evaluate a wide range of open-source and closed-source models on GSM8K-V. Results show that although existing VLMs have nearly saturated performance on text-based GSM8K, there remains substantial room for improvement on GSM8K-V. For example, the best-performing model, Gemini-2.5-Pro, achieves 95.22% accuracy on GSM8K but only 46.93% on GSM8K-V. We conduct a comprehensive analysis of GSM8K-V, examining the limitations of current models as well as potential directions for improvement. GSM8K-V offers a new perspective on visual mathematical reasoning and establishes a benchmark to guide the development of more robust and generalizable VLMs.
中文: GSM8K-V作为一个纯视觉多图像数学推理基准被提出,旨在弥补现有基准的不足,揭示了当前视觉语言模型虽然在文本数学推理上表现优异,但在视觉数学推理方面仍有巨大提升空间。
English: GSM8K-V is introduced as a purely visual multi-image mathematical reasoning benchmark to address gaps in existing benchmarks, revealing significant performance disparities between text-based and visual mathematical reasoning in current vision language models despite their advanced capabilities.

Authors:Zhaozhi Wang, Tong Zhang, Mingyue Guo, Yaowei Wang, Qixiang Ye
Title: VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language alignment, yet they remain limited in visual-spatial reasoning. We first identify that this limitation arises from the attention mechanism: visual tokens are overshadowed by language tokens, preventing the model from consistently recognizing the same visual cues across frames. To address this challenge, we draw a novel connection between the self-expressiveness property in sparse subspace clustering and the attention mechanism in Transformers. Building on this insight, we propose VideoAnchor, a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining, effectively anchoring attention to shared visual structures. Extensive experiments across benchmarks and backbone models show consistent performance gains -- $e.g.$, 3.2% and 4.6% improvements on VSI-Bench and Video-MME (spatial-related tasks) with InternVL2-8B and Qwen2.5VL-72B -- while qualitative analyses demonstrate more coherent subspace partitions and stronger visual grounding. Our codes will be made public available at https://github.com/feufhd/VideoAnchor.
中文: 多模态大语言模型因语言标记主导注意力而存在视觉空间推理局限,但提出的VideoAnchor模块利用子空间聚类原理增强跨帧视觉线索且无需重新训练,在空间任务上实现了显著性能提升。
English: Multimodal Large Language Models struggle with visual-spatial reasoning due to language tokens dominating attention, but the proposed VideoAnchor module leverages subspace clustering principles to reinforce visual cues across frames without retraining, achieving significant performance improvements on spatial tasks.

Authors:Tomoyuki Suzuki, Kang-Jun Liu, Naoto Inoue, Kota Yamaguchi
Title: LayerD: Decomposing Raster Graphic Designs into Layers
Abstract:
Designers craft and edit graphic designs in a layer representation, but layer-based editing becomes impossible once composited into a raster image. In this work, we propose LayerD, a method to decompose raster graphic designs into layers for re-editable creative workflow. LayerD addresses the decomposition task by iteratively extracting unoccluded foreground layers. We propose a simple yet effective refinement approach taking advantage of the assumption that layers often exhibit uniform appearance in graphic designs. As decomposition is ill-posed and the ground-truth layer structure may not be reliable, we develop a quality metric that addresses the difficulty. In experiments, we show that LayerD successfully achieves high-quality decomposition and outperforms baselines. We also demonstrate the use of LayerD with state-of-the-art image generators and layer-based editing.
中文: LayerD是一种创新方法,通过迭代提取未遮挡前景层并基于图层均匀外观假设进行优化,将栅格图形设计分解为可编辑图层,在实验中优于基线方法,并能与先进图像生成器协同实现可重复编辑的工作流程。
English: LayerD is a novel method that decomposes raster graphic designs into editable layers by iteratively extracting unoccluded foregrounds and refining them based on uniform appearance assumptions, outperforming baselines and enabling re-editable workflows with modern image generators.

Authors:Chengyao Wang, Zhisheng Zhong, Bohao Peng, Senqiao Yang, Yuqi Liu, Haokun Gui, Bin Xia, Jingyao Li, Bei Yu, Jiaya Jia
Title: MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Abstract:
We present MGM-Omni, a unified Omni LLM for omni-modal understanding and expressive, long-horizon speech generation. Unlike cascaded pipelines that isolate speech synthesis, MGM-Omni adopts a "brain-mouth" design with a dual-track, token-based architecture that cleanly decouples multimodal reasoning from real-time speech generation. This design enables efficient cross-modal interaction and low-latency, streaming speech generation. For understanding, a unified training strategy coupled with a dual audio encoder design enables long-form audio perception across diverse acoustic conditions. For generation, a chunk-based parallel decoding scheme narrows the text speech token-rate gap, accelerating inference and supporting streaming zero-shot voice cloning with stable timbre over extended durations. Compared to concurrent work, MGM-Omni achieves these capabilities with markedly data-efficient training. Extensive experiments demonstrate that MGM-Omni outperforms existing open source models in preserving timbre identity across extended sequences, producing natural and context-aware speech, and achieving superior long-form audio and omnimodal understanding. MGM-Omni establishes an efficient, end-to-end paradigm for omnimodal understanding and controllable, personalised long-horizon speech generation.
Chinese: MGM-Omni 是一种统一的Omni LLM,通过双轨架构高效处理多模态理解和富有表现力的长时程语音生成,实现了低延迟流式处理,并在数据高效训练的基础上,在各项任务中展现出卓越性能。
English: MGM-Omni is a unified Omni LLM that efficiently handles multimodal understanding and expressive, long-horizon speech generation through a dual-track architecture, enabling low-latency streaming and superior performance across diverse tasks with data-efficient training.

Authors:M A Al-Masud, Juan Miguel Lopez Alcaraz, Nils Strodthoff
Title: Benchmarking ECG Foundational Models: A Reality Check Across Clinical Tasks
Abstract:
The 12-lead electrocardiogram (ECG) is a long-standing diagnostic tool. Yet machine learning for ECG interpretation remains fragmented, often limited to narrow tasks or datasets. Foundation models promise broader adaptability, but their generalization across diverse ECG tasks is not well understood. We benchmarked eight ECG foundation models on 26 clinically relevant tasks using 12 public datasets comprising 1,650 regression and classification targets. Models were evaluated under fine-tuning and frozen settings, with scaling analyses across dataset sizes. Results show heterogeneous performance across domains: in the most widely studied domain, adult ECG interpretation, three foundation models consistently outperformed strong supervised baselines. In contrast, ECG-CPC, a compact structured state-space model pretrained on HEEDB, dominated other categories where most foundation models failed to surpass supervised learning. Foundation models also displayed distinct scaling behaviors with dataset size, which are critical for small-scale clinical applications. Overall, while foundation models show promise for adult ECG analysis, substantial gaps remain in cardiac structure, outcome prediction, and patient characterization. Notably, ECG-CPC's strong performance despite being orders of magnitude smaller and consuming minimal computational resources highlights untapped opportunities for advancing ECG foundation models.
中文摘要:本研究评估了八种心电图基础模型在26项临床任务中的表现,发现这些模型在成人心电图解读方面展现出潜力,但在心脏结构和预后预测方面仍存在明显不足,其中ECG-CPC模型虽结构紧凑却表现出卓越效能。
English Summary: This study evaluated eight ECG foundation models across 26 clinical tasks, finding they show promise for adult ECG interpretation but exhibit significant performance gaps in cardiac structure and outcome prediction, with ECG-CPC emerging as a particularly efficient model despite its compact size.

Authors:AmirHossein Zamani, Bruno Roy, Arianna Rampini
Title: Unsupervised Representation Learning for 3D Mesh Parameterization with Semantic and Visibility Objectives
Abstract:
Recent 3D generative models produce high-quality textures for 3D mesh objects. However, they commonly rely on the heavy assumption that input 3D meshes are accompanied by manual mesh parameterization (UV mapping), a manual task that requires both technical precision and artistic judgment. Industry surveys show that this process often accounts for a significant share of asset creation, creating a major bottleneck for 3D content creators. Moreover, existing automatic methods often ignore two perceptually important criteria: (1) semantic awareness (UV charts should align semantically similar 3D parts across shapes) and (2) visibility awareness (cutting seams should lie in regions unlikely to be seen). To overcome these shortcomings and to automate the mesh parameterization process, we present an unsupervised differentiable framework that augments standard geometry-preserving UV learning with semantic- and visibility-aware objectives. For semantic-awareness, our pipeline (i) segments the mesh into semantic 3D parts, (ii) applies an unsupervised learned per-part UV-parameterization backbone, and (iii) aggregates per-part charts into a unified UV atlas. For visibility-awareness, we use ambient occlusion (AO) as an exposure proxy and back-propagate a soft differentiable AO-weighted seam objective to steer cutting seams toward occluded regions. By conducting qualitative and quantitative evaluations against state-of-the-art methods, we show that the proposed method produces UV atlases that better support texture generation and reduce perceptible seam artifacts compared to recent baselines. Our implementation code is publicly available at: https://github.com/AHHHZ975/Semantic-Visibility-UV-Param.
中文: 近期3D生成模型依赖手动UV映射这一耗时环节,我们提出的无监督框架通过融入语义感知和可见性感知机制,自动生成优化后的UV图谱,显著提升纹理生成质量并减少接缝瑕疵。
English: Recent 3D generative models require manual UV mapping, a bottleneck in asset creation, but our unsupervised framework automates this process by incorporating semantic and visibility awareness to produce superior UV atlases that enhance texture generation and minimize seam visibility.

Authors:Bogdan Raonić, Siddhartha Mishra, Samuel Lanthaler
Title: Towards a Certificate of Trust: Task-Aware OOD Detection for Scientific AI
Abstract:
Data-driven models are increasingly adopted in critical scientific fields like weather forecasting and fluid dynamics. These methods can fail on out-of-distribution (OOD) data, but detecting such failures in regression tasks is an open challenge. We propose a new OOD detection method based on estimating joint likelihoods using a score-based diffusion model. This approach considers not just the input but also the regression model's prediction, providing a task-aware reliability score. Across numerous scientific datasets, including PDE datasets, satellite imagery and brain tumor segmentation, we show that this likelihood strongly correlates with prediction error. Our work provides a foundational step towards building a verifiable 'certificate of trust', thereby offering a practical tool for assessing the trustworthiness of AI-based scientific predictions. Our code is publicly available at https://github.com/bogdanraonic3/OOD_Detection_ScientificML
Chinese: 本文提出了一种基于分数扩散模型估计联合似然的新颖分布外检测方法,通过提供与多种科学数据集预测误差相关的任务感知可靠性评分,推进了可信人工智能预测的发展。
English: This paper introduces a novel out-of-distribution detection method using score-based diffusion models to estimate joint likelihoods, providing a task-aware reliability score that correlates with prediction errors across diverse scientific datasets and advancing trustworthy AI predictions.

Authors:Huaizhi Qu, Xiao Wang, Gengwei Zhang, Jie Peng, Tianlong Chen
Title: GEM: 3D Gaussian Splatting for Efficient and Accurate Cryo-EM Reconstruction
Abstract:
Cryo-electron microscopy (cryo-EM) has become a central tool for high-resolution structural biology, yet the massive scale of datasets (often exceeding 100k particle images) renders 3D reconstruction both computationally expensive and memory intensive. Traditional Fourier-space methods are efficient but lose fidelity due to repeated transforms, while recent real-space approaches based on neural radiance fields (NeRFs) improve accuracy but incur cubic memory and computation overhead. Therefore, we introduce GEM, a novel cryo-EM reconstruction framework built on 3D Gaussian Splatting (3DGS) that operates directly in real-space while maintaining high efficiency. Instead of modeling the entire density volume, GEM represents proteins with compact 3D Gaussians, each parameterized by only 11 values. To further improve the training efficiency, we designed a novel gradient computation to 3D Gaussians that contribute to each voxel. This design substantially reduced both memory footprint and training cost. On standard cryo-EM benchmarks, GEM achieves up to 48% faster training and 12% lower memory usage compared to state-of-the-art methods, while improving local resolution by as much as 38.8%. These results establish GEM as a practical and scalable paradigm for cryo-EM reconstruction, unifying speed, efficiency, and high-resolution accuracy. Our code is available at https://github.com/UNITES-Lab/GEM.
中文: GEM提出了一种基于3D高斯点渲染的新型冷冻电镜重建框架,相比现有方法显著降低了内存占用和训练成本,同时提高了分辨率。
English: GEM introduces a novel cryo-EM reconstruction framework using 3D Gaussian Splatting, which significantly reduces memory usage and training costs while improving resolution compared to existing methods.

Authors:Kenny Truong, Yongkyu Lee, Jason Irie, Shivam Kumar Panda, Mohammad Jony, Shahab Ahmad, Md. Mukhlesur Rahman, M. Khalid Jawed
Title: AgriCruiser: An Open Source Agriculture Robot for Over-the-row Navigation
Abstract:
We present the AgriCruiser, an open-source over-the-row agricultural robot developed for low-cost deployment and rapid adaptation across diverse crops and row layouts. The chassis provides an adjustable track width of 1.42 m to 1.57 m, along with a ground clearance of 0.94 m. The AgriCruiser achieves compact pivot turns with radii of 0.71 m to 0.79 m, enabling efficient headland maneuvers. The platform is designed for the integration of the other subsystems, and in this study, a precision spraying system was implemented to assess its effectiveness in weed management. In twelve flax plots, a single robotic spray pass reduced total weed populations (pigweed and Venice mallow) by 24- to 42-fold compared to manual weeding in four flax plots, while also causing less crop damage. Mobility experiments conducted on concrete, asphalt, gravel, grass, and both wet and dry soil confirmed reliable traversal consistent with torque sizing. The complete chassis can be constructed from commodity T-slot extrusion with minimal machining, resulting in a bill of materials costing approximately $5,000 - $6,000, which enables replication and customization. The mentioned results demonstrate that low-cost, reconfigurable over-the-row robots can achieve effective weed management with reduced crop damage and labor requirements, while providing a versatile foundation for phenotyping, sensing, and other agriculture applications. Design files and implementation details are released to accelerate research and adoption of modular agricultural robotics.
中文:AgriCruiser是一款低成本开源农业机器人,可适应不同作物和行距,其精准喷洒系统能高效除草并减少作物损伤,为农业应用提供了灵活基础。
English: The AgriCruiser is a low-cost, open-source agricultural robot designed for adaptable use across various crops, featuring adjustable track width and effective precision spraying that significantly reduces weeds with minimal crop damage.

Authors:Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma
Title: Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
Abstract:
Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.
中文摘要:本文提出优势加权匹配方法,通过统一扩散模型的预训练与强化学习目标来降低方差并加速收敛,在多个基准测试中实现高达24倍的训练加速且不损失生成质量。
English Summary: This paper introduces Advantage Weighted Matching (AWM), a reinforcement learning method for diffusion models that aligns training objectives with pretraining to reduce variance and accelerate convergence, achieving up to 24x speedup over prior methods while maintaining output quality.

Authors:Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin
Title: VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning
Abstract:
Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.
Chinese: VT-FSL框架通过跨模态迭代提示和几何对齐,利用大语言模型生成精确的类别描述和合成图像,以增强小样本学习,在多个基准测试中取得了最先进的性能。
English: The VT-FSL framework introduces cross-modal iterative prompting and geometric alignment to enhance few-shot learning by generating precise class descriptions and synthetic images using LLMs, achieving state-of-the-art results across multiple benchmarks.

Authors:Xiaoxiao Ma, Haibo Qiu, Guohui Zhang, Zhixiong Zeng, Siqi Yang, Lin Ma, Feng Zhao
Title: STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation
Abstract:
Reinforcement learning has recently been explored to improve text-to-image generation, yet applying existing GRPO algorithms to autoregressive (AR) image models remains challenging. The instability of the training process easily disrupts the pretrained model capability during long runs, resulting in marginal gains, degraded image quality, and poor generalization. In this work, we revisit GRPO for AR image generation and identify two key issues: contradictory gradients from unnecessary tokens and unstable policy entropy dynamics. To address these, we introduce STAGE, a stable and generalizable framework that leverages two targeted solutions: 1) Advantage/KL reweighting. Similarity-aware reweighting to alleviate conflicting updates; and 2) Entropy reward. An entropy-based reward corresponding to reference model to stabilize learning. With the help of alleviating conflicts between tokens and an entropy reward for stabilizing training, we reduce disruption of the pretrained distribution and mitigate reward hacking, which in turn improves generalization and transfer better to other benchmarks. Experiments across multiple benchmarks show that STAGE consistently improves visual quality, stability, and cross-task generalization compared to baseline GRPO.
中文:STAGE框架通过解决自回归图像生成中令牌梯度冲突和稳定熵动态,有效提升了强化学习的训练稳定性,在多个基准测试中实现了更好的图像质量和泛化能力。
English: The STAGE framework addresses instability in reinforcement learning for autoregressive image generation by resolving contradictory token gradients and stabilizing entropy dynamics, leading to improved image quality and generalization across benchmarks.

Authors:Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
Title: Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns
Abstract:
Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model's "self-knowledge", i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer's correctness that is accessible to the model itself. However, our experiments reveal that an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM. Moreover, we hypothesize that a key factor in building a "Correctness Model" (CM) is exposure to a target model's historical predictions. We propose multiple methods to inject this historical correctness information, creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness data from many LLMs and learn patterns for correctness prediction applicable across datasets and models. We then use CMs as a lens for studying the source of correctness prediction ability and its generalization, systematically controlling their training data and finding that answer phrasing is a strong predictor for correctness. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide complementary reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.
中文: 可靠的LLM置信度估计是一种通过系统编码正确性历史而非依赖模型自省习得的可泛化技能,实验表明无关模型预测答案正确性的能力与模型自身相当。
English: Accurate confidence estimation for LLMs is a generalizable skill achieved by systematically encoding correctness history rather than relying on model self-introspection, as demonstrated through experiments showing that unrelated models can predict correctness as effectively as the model itself.

Authors:Lekang Yang, Yuetong Liu, Yitong Zhang, Jia Li
Title: DiffTester: Accelerating Unit Test Generation for Diffusion LLMs via Repetitive Pattern
Abstract:
Software development relies heavily on extensive unit testing, which makes the efficiency of automated Unit Test Generation (UTG) particularly important. However, most existing LLMs generate test cases one token at a time in each forward pass, which leads to inefficient UTG. Recently, diffusion LLMs (dLLMs) have emerged, offering promising parallel generation capabilities and showing strong potential for efficient UTG. Despite this advantage, their application to UTG is still constrained by a clear trade-off between efficiency and test quality, since increasing the number of tokens generated in each step often causes a sharp decline in the quality of test cases. To overcome this limitation, we present DiffTester, an acceleration framework specifically tailored for dLLMs in UTG. The key idea of DiffTester is that unit tests targeting the same focal method often share repetitive structural patterns. By dynamically identifying these common patterns through abstract syntax tree analysis during generation, DiffTester adaptively increases the number of tokens produced at each step without compromising the quality of the output. To enable comprehensive evaluation, we extend the original TestEval benchmark, which was limited to Python, by introducing additional programming languages including Java and C++. Extensive experiments on three benchmarks with two representative models show that DiffTester delivers significant acceleration while preserving test coverage. Moreover, DiffTester generalizes well across different dLLMs and programming languages, providing a practical and scalable solution for efficient UTG in software development. Code and data are publicly available at https://github.com/wellbeingyang/DLM4UTG-open .
中文摘要:DiffTester是一个针对扩散大模型的加速框架,通过动态识别重复结构模式实现并行令牌生成,在保持测试质量和覆盖率的同时显著提升单元测试生成效率,并能跨多种编程语言有效应用。
English Summary: DiffTester is an acceleration framework that enhances the efficiency of diffusion-based language models for unit test generation by dynamically identifying repetitive structural patterns, enabling faster parallel token generation without sacrificing test quality or coverage across multiple programming languages.

Authors:Kaihong Li, Huichi Zhou, Bin Ma, Fangjun Huang
Title: SemanticShield: LLM-Powered Audits Expose Shilling Attacks in Recommender Systems
Abstract:
Recommender systems (RS) are widely used in e-commerce for personalized suggestions, yet their openness makes them susceptible to shilling attacks, where adversaries inject fake behaviors to manipulate recommendations. Most existing defenses emphasize user-side behaviors while overlooking item-side features such as titles and descriptions that can expose malicious intent. To address this gap, we propose a two-stage detection framework that integrates item-side semantics via large language models (LLMs). The first stage pre-screens suspicious users using low-cost behavioral criteria, and the second stage employs LLM-based auditing to evaluate semantic consistency. Furthermore, we enhance the auditing model through reinforcement fine-tuning on a lightweight LLM with carefully designed reward functions, yielding a specialized detector called SemanticShield. Experiments on six representative attack strategies demonstrate the effectiveness of SemanticShield against shilling attacks, and further evaluation on previously unseen attack methods shows its strong generalization capability. Code is available at https://github.com/FrankenstLee/SemanticShield.
中文摘要:本文提出SemanticShield框架,通过整合大语言模型分析商品端语义特征来检测推荐系统中的托攻击,实验证明该方法在不同攻击策略下均有效且具备良好泛化能力。
English Summary: This paper introduces SemanticShield, a two-stage framework that leverages large language models to detect shilling attacks in recommender systems by analyzing item-side semantic features, demonstrating strong performance and generalization across various attack strategies.

Authors:Angxiao Yue, Anqi Dong, Hongteng Xu
Title: OAT-FM: Optimal Acceleration Transport for Improved Flow Matching
Abstract:
As a powerful technique in generative modeling, Flow Matching (FM) aims to learn velocity fields from noise to data, which is often explained and implemented as solving Optimal Transport (OT) problems. In this study, we bridge FM and the recent theory of Optimal Acceleration Transport (OAT), developing an improved FM method called OAT-FM and exploring its benefits in both theory and practice. In particular, we demonstrate that the straightening objective hidden in existing OT-based FM methods is mathematically equivalent to minimizing the physical action associated with acceleration defined by OAT. Accordingly, instead of enforcing constant velocity, OAT-FM optimizes the acceleration transport in the product space of sample and velocity, whose objective corresponds to a necessary and sufficient condition of flow straightness. An efficient algorithm is designed to achieve OAT-FM with low complexity. OAT-FM motivates a new two-phase FM paradigm: Given a generative model trained by an arbitrary FM method, whose velocity information has been relatively reliable, we can fine-tune and improve it via OAT-FM. This paradigm eliminates the risk of data distribution drift and the need to generate a large number of noise data pairs, which consistently improves model performance in various generative tasks. Code is available at: https://github.com/AngxiaoYue/OAT-FM
中文: 本研究将流匹配与最优加速传输理论相结合,提出了OAT-FM方法,通过优化加速度传输实现更直的流形轨迹,并采用两阶段微调策略提升生成模型的性能。
English: Flow Matching (FM) is enhanced by integrating it with Optimal Acceleration Transport (OAT), resulting in OAT-FM, which optimizes acceleration transport for straighter flows and improved generative model performance through a two-phase fine-tuning approach.

Authors:Tian Xia, Matthew Sinclair, Andreas Schuh, Fabio De Sousa Ribeiro, Raghav Mehta, Rajat Rasal, Esther Puyol-Antón, Samuel Gerber, Kersten Petersen, Michiel Schaap, Ben Glocker
Title: Segmentor-Guided Counterfactual Fine-Tuning for Locally Coherent and Targeted Image Synthesis
Abstract:
Counterfactual image generation is a powerful tool for augmenting training data, de-biasing datasets, and modeling disease. Current approaches rely on external classifiers or regressors to increase the effectiveness of subject-level interventions (e.g., changing the patient's age). For structure-specific interventions (e.g., changing the area of the left lung in a chest radiograph), we show that this is insufficient, and can result in undesirable global effects across the image domain. Previous work used pixel-level label maps as guidance, requiring a user to provide hypothetical segmentations which are tedious and difficult to obtain. We propose Segmentor-guided Counterfactual Fine-Tuning (Seg-CFT), which preserves the simplicity of intervening on scalar-valued, structure-specific variables while producing locally coherent and effective counterfactuals. We demonstrate the capability of generating realistic chest radiographs, and we show promising results for modeling coronary artery disease. Code: https://github.com/biomedia-mira/seg-cft.
中文: 现有的反事实图像生成方法在处理结构特异性干预时效果不足且依赖繁琐的像素级标注,而本文提出的Seg-CFT方法通过简单标量变量即可生成局部一致的反事实图像,在医学影像领域展现出良好应用前景。
English: Current counterfactual image generation methods are insufficient for structure-specific interventions and require tedious pixel-level guidance, but the proposed Seg-CFT approach effectively produces locally coherent counterfactuals using simple scalar variables while demonstrating promising results in medical imaging.

Authors:Thanh Long Nguyen, Duc Phu Nguyen, Thanh Thao Ton Nu, Quan Le, Thuan Hoang Tran, Manh Duong Phung
Title: Real-time Recognition of Human Interactions from a Single RGB-D Camera for Socially-Aware Robot Navigation
Abstract:
{Recognizing human interactions is essential for social robots as it enables them to navigate safely and naturally in shared environments. Conventional robotic systems however often focus on obstacle avoidance, neglecting social cues necessary for seamless human-robot interaction. To address this gap, we propose a framework to recognize human group interactions for socially aware navigation. Our method utilizes color and depth frames from a monocular RGB-D camera to estimate 3D human keypoints and positions. Principal component analysis (PCA) is then used to determine dominant interaction directions. The shoelace formula is finally applied to compute interest points and engagement areas. Extensive experiments have been conducted to evaluate the validity of the proposed method. The results show that our method is capable of recognizing group interactions across different scenarios with varying numbers of individuals. It also achieves high-speed performance, processing each frame in approximately 4 ms on a single-board computer used in robotic systems. The method is implemented as a ROS 2 package making it simple to integrate into existing navigation systems. Source code is available at https://github.com/thanhlong103/social-interaction-detector
Chinese: 本研究提出了一种利用RGB-D相机数据、3D关键点估计和几何分析来识别人群交互的框架,以实现社会感知的机器人导航,具备实时处理能力,并通过ROS 2软件包实现便捷集成。
English: This study introduces a framework for recognizing human group interactions using RGB-D camera data, 3D keypoint estimation, and geometric analysis to enable socially aware robot navigation, achieving real-time performance and easy integration via a ROS 2 package.

Authors:Yu Ma, Guoliang Wei, Haihong Xiao, Yue Cheng
Title: HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping
Abstract:
Novel View Synthesis (NVS) from sparse views presents a formidable challenge in 3D reconstruction, where limited multi-view constraints lead to severe overfitting, geometric distortion, and fragmented scenes. While 3D Gaussian Splatting (3DGS) delivers real-time, high-fidelity rendering, its performance drastically deteriorates under sparse inputs, plagued by floating artifacts and structural failures. To address these challenges, we introduce HBSplat, a unified framework that elevates 3DGS by seamlessly integrating robust structural cues, virtual view constraints, and occluded region completion. Our core contributions are threefold: a Hybrid-Loss Depth Estimation module that ensures multi-view consistency by leveraging dense matching priors and integrating reprojection, point propagation, and smoothness constraints; a Bidirectional Warping Virtual View Synthesis method that enforces substantially stronger constraints by creating high-fidelity virtual views through bidirectional depth-image warping and multi-view fusion; and an Occlusion-Aware Reconstruction component that recovers occluded areas using a depth-difference mask and a learning-based inpainting model. Extensive evaluations on LLFF, Blender, and DTU benchmarks validate that HBSplat sets a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while maintaining real-time inference. Code is available at: https://github.com/eternalland/HBSplat.
Chinese: HBSplat通过整合结构线索、虚拟视图合成和遮挡感知重建,提升了3D高斯溅射在稀疏视图下的表现,实现了最先进的实时渲染效果。
English: HBSplat enhances 3D Gaussian Splatting by integrating structural cues, virtual view synthesis, and occlusion-aware reconstruction to overcome sparse view challenges, achieving state-of-the-art performance and real-time rendering.

Authors:Teodor Chiaburu, Vipin Singh, Frank Haußer, Felix Bießmann
Title: Uncertainty-Guided Expert-AI Collaboration for Efficient Soil Horizon Annotation
Abstract:
Uncertainty quantification is essential in human-machine collaboration, as human agents tend to adjust their decisions based on the confidence of the machine counterpart. Reliably calibrated model uncertainties, hence, enable more effective collaboration, targeted expert intervention and more responsible usage of Machine Learning (ML) systems. Conformal prediction has become a well established model-agnostic framework for uncertainty calibration of ML models, offering statistically valid confidence estimates for both regression and classification tasks. In this work, we apply conformal prediction to $\textit{SoilNet}$, a multimodal multitask model for describing soil profiles. We design a simulated human-in-the-loop (HIL) annotation pipeline, where a limited budget for obtaining ground truth annotations from domain experts is available when model uncertainty is high. Our experiments show that conformalizing SoilNet leads to more efficient annotation in regression tasks and comparable performance scores in classification tasks under the same annotation budget when tested against its non-conformal counterpart. All code and experiments can be found in our repository: https://github.com/calgo-lab/BGR
中文: 保形预测改进了SoilNet模型的不确定性校准,在有限专家标注预算下,实现了回归任务中人机协同标注效率的提升,同时保持了分类任务的同等性能水平。
English: Conformal prediction enhances SoilNet's uncertainty calibration, enabling more efficient human-in-the-loop soil annotation in regression tasks while maintaining classification performance under limited expert budgets.

Authors:Jiayi Li, Flora D. Salim
Title: DRIFT-Net: A Spectral--Coupled Neural Operator for PDEs Learning
Abstract:
Learning PDE dynamics with neural solvers can significantly improve wall-clock efficiency and accuracy compared with classical numerical solvers. In recent years, foundation models for PDEs have largely adopted multi-scale windowed self-attention, with the scOT backbone in \textsc{Poseidon} serving as a representative example. However, because of their locality, truly globally consistent spectral coupling can only be propagated gradually through deep stacking and window shifting. This weakens global coupling and leads to error accumulation and drift during closed-loop rollouts. To address this, we propose \textbf{DRIFT-Net}. It employs a dual-branch design comprising a spectral branch and an image branch. The spectral branch is responsible for capturing global, large-scale low-frequency information, whereas the image branch focuses on local details and nonstationary structures. Specifically, we first perform controlled, lightweight mixing within the low-frequency range. Then we fuse the spectral and image paths at each layer via bandwise weighting, which avoids the width inflation and training instability caused by naive concatenation. The fused result is transformed back into the spatial domain and added to the image branch, thereby preserving both global structure and high-frequency details across scales. Compared with strong attention-based baselines, DRIFT-Net achieves lower error and higher throughput with fewer parameters under identical training settings and budget. On Navier--Stokes benchmarks, the relative $L_{1}$ error is reduced by 7\%--54\%, the parameter count decreases by about 15\%, and the throughput remains higher than scOT. Ablation studies and theoretical analyses further demonstrate the stability and effectiveness of this design. The code is available at https://github.com/cruiseresearchgroup/DRIFT-Net.
Chinese: DRIFT-Net采用双分支架构,结合频谱与图像路径来增强PDE学习中的全局耦合与局部细节保留,相比基于注意力的模型,在参数更少的情况下实现了更高的精度、效率和更低的误差率。
English: DRIFT-Net introduces a dual-branch architecture that combines spectral and image pathways to enhance global coupling and local detail preservation in PDE learning, achieving superior accuracy and efficiency over attention-based models with fewer parameters and reduced error rates.

Authors:Alexandre Queant, Ulysse Rançon, Benoit R Cottereau, Timothée Masquelier
Title: DelRec: learning delays in recurrent spiking neural networks
Abstract:
Spiking neural networks (SNNs) are a bio-inspired alternative to conventional real-valued deep learning models, with the potential for substantially higher energy efficiency. Interest in SNNs has recently exploded due to a major breakthrough: surrogate gradient learning (SGL), which allows training SNNs with backpropagation, strongly outperforming other approaches. In SNNs, each synapse is characterized not only by a weight but also by a transmission delay. While theoretical works have long suggested that trainable delays significantly enhance expressivity, practical methods for learning them have only recently emerged. Here, we introduce ''DelRec'', the first SGL-based method to train axonal or synaptic delays in recurrent spiking layers, compatible with any spiking neuron model. DelRec leverages a differentiable interpolation technique to handle non-integer delays with well-defined gradients at training time. We show that trainable recurrent delays outperform feedforward ones, leading to new state-of-the-art (SOTA) on two challenging temporal datasets (Spiking Speech Command, an audio dataset, and Permuted Sequential MNIST, a vision one), and match the SOTA on the now saturated Spiking Heidelberg Digit dataset using only vanilla Leaky-Integrate-and-Fire neurons with stateless (instantaneous) synapses. Our results demonstrate that recurrent delays are critical for temporal processing in SNNs and can be effectively optimized with DelRec, paving the way for efficient deployment on neuromorphic hardware with programmable delays. Our code is available at : https://github.com/alexmaxad/DelRec.
Chinese: DelRec首次提出基于代理梯度学习的递归脉冲神经网络延迟训练方法,在时序数据集上达到最优性能,证明了可训练延迟对于高效神经形态硬件部署的关键作用。
English: DelRec introduces the first surrogate gradient learning method to train recurrent spiking neural network delays, achieving state-of-the-art performance on temporal datasets and demonstrating the critical role of trainable delays for efficient neuromorphic deployment.

Authors:Hannah Kim, Kushan Mitra, Chen Shen, Dan Zhang, Estevam Hruschka
Title: AIPOM: Agent-aware Interactive Planning for Multi-Agent Systems
Abstract:
Large language models (LLMs) are being increasingly used for planning in orchestrated multi-agent systems. However, existing LLM-based approaches often fall short of human expectations and, critically, lack effective mechanisms for users to inspect, understand, and control their behaviors. These limitations call for enhanced transparency, controllability, and human oversight. To address this, we introduce AIPOM, a system supporting human-in-the-loop planning through conversational and graph-based interfaces. AIPOM enables users to transparently inspect, refine, and collaboratively guide LLM-generated plans, significantly enhancing user control and trust in multi-agent workflows. Our code and demo video are available at https://github.com/megagonlabs/aipom.
中文: AIPOM系统通过对话和图界面增强多智能体规划中的透明度与用户控制,有效解决了当前LLM方法在可检查性和人工监督方面的不足。
English: The AIPOM system introduces conversational and graph-based interfaces to enhance transparency and user control in LLM-driven multi-agent planning, addressing current limitations in inspectability and human oversight.

Authors:Rui Jia, Yuang Wei, Ruijia Li, Yuang-Hao Jiang, Xinyu Xie, Yaomin Shen, Min Zhang, Bo Jiang
Title: DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework
Abstract:
While cognitive diagnosis (CD) effectively assesses students' knowledge mastery from structured test data, applying it to real-world teacher-student dialogues presents two fundamental challenges. Traditional CD models lack a suitable framework for handling dynamic, unstructured dialogues, and it's difficult to accurately extract diagnostic semantics from lengthy dialogues. To overcome these hurdles, we propose DiaCDM, an innovative model. We've adapted the initiation-response-evaluation (IRE) framework from educational theory to design a diagnostic framework tailored for dialogue. We also developed a unique graph-based encoding method that integrates teacher questions with relevant knowledge components to capture key information more precisely. To our knowledge, this is the first exploration of cognitive diagnosis in a dialogue setting. Experiments on three real-world dialogue datasets confirm that DiaCDM not only significantly improves diagnostic accuracy but also enhances the results' interpretability, providing teachers with a powerful tool for assessing students' cognitive states. The code is available at https://github.com/Mind-Lab-ECNU/DiaCDM/tree/main.
Chinese: 本研究提出DiaCDM模型,通过采用IRE教育框架和基于图的编码方法,在师生对话中实现认知诊断,显著提升了诊断准确性和结果可解释性。
English: The study introduces DiaCDM, a novel model that adapts the IRE framework and employs graph-based encoding to perform cognitive diagnosis in teacher-student dialogues, significantly enhancing accuracy and interpretability.

Authors:Boxuan Zhang, Runqing Wang, Wei Xiao, Weipu Zhang, Jian Sun, Gao Huang, Jie Chen, Gang Wang
Title: DyMoDreamer: World Modeling with Dynamic Modulation
Abstract:
A critical bottleneck in deep reinforcement learning (DRL) is sample inefficiency, as training high-performance agents often demands extensive environmental interactions. Model-based reinforcement learning (MBRL) mitigates this by building world models that simulate environmental dynamics and generate synthetic experience, improving sample efficiency. However, conventional world models process observations holistically, failing to decouple dynamic objects and temporal features from static backgrounds. This approach is computationally inefficient, especially for visual tasks where dynamic objects significantly influence rewards and decision-making performance. To address this, we introduce DyMoDreamer, a novel MBRL algorithm that incorporates a dynamic modulation mechanism to improve the extraction of dynamic features and enrich the temporal information. DyMoDreamer employs differential observations derived from a novel inter-frame differencing mask, explicitly encoding object-level motion cues and temporal dynamics. Dynamic modulation is modeled as stochastic categorical distributions and integrated into a recurrent state-space model (RSSM), enhancing the model's focus on reward-relevant dynamics. Experiments demonstrate that DyMoDreamer sets a new state-of-the-art on the Atari $100$k benchmark with a $156.6$\% mean human-normalized score, establishes a new record of $832$ on the DeepMind Visual Control Suite, and gains a $9.5$\% performance improvement after $1$M steps on the Crafter benchmark. Our code is released at https://github.com/Ultraman-Tiga1/DyMoDreamer.
Chinese: DyMoDreamer 在基于模型的强化学习中引入动态调制机制,以增强动态特征和时间信息的提取,在多个基准测试中实现了最先进的性能。
English: DyMoDreamer introduces a dynamic modulation mechanism in model-based reinforcement learning to enhance the extraction of dynamic features and temporal information, achieving state-of-the-art performance on multiple benchmarks.

Authors:Hongyang Zhang, Yinhao Liu, Zhenyu Kuang
Title: SkyLink: Unifying Street-Satellite Geo-Localization via UAV-Mediated 3D Scene Alignment
Abstract:
Cross-view geo-localization aims at establishing location correspondences between different viewpoints. Existing approaches typically learn cross-view correlations through direct feature similarity matching, often overlooking semantic degradation caused by extreme viewpoint disparities. To address this unique problem, we focus on robust feature retrieval under viewpoint variation and propose the novel SkyLink method. We firstly utilize the Google Retrieval Enhancement Module to perform data enhancement on street images, which mitigates the occlusion of the key target due to restricted street viewpoints. The Patch-Aware Feature Aggregation module is further adopted to emphasize multiple local feature aggregations to ensure the consistent feature extraction across viewpoints. Meanwhile, we integrate the 3D scene information constructed from multi-scale UAV images as a bridge between street and satellite viewpoints, and perform feature alignment through self-supervised and cross-view contrastive learning. Experimental results demonstrate robustness and generalization across diverse urban scenarios, which achieve 25.75$\%$ Recall@1 accuracy on University-1652 in the UAVM2025 Challenge. Code will be released at https://github.com/HRT00/CVGL-3D.
Chinese: SkyLink方法通过数据增强、局部特征聚合和三维场景整合,有效缓解视角差异导致的语义退化,在University-1652数据集上实现了25.75%的Recall@1准确率,提升了跨视角地理定位的鲁棒性。
English: The SkyLink method enhances cross-view geo-localization by mitigating semantic degradation through data enhancement, patch-aware feature aggregation, and 3D scene integration, achieving robust performance with 25.75% Recall@1 accuracy on University-1652.

Authors:Yizhuo Ding, Mingkang Chen, Zhibang Feng, Tong Xiao, Wanying Qu, Wenqi Shao, Yanwei Fu
Title: VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding
Abstract:
Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two MLLMs. Our findings show that explicit perception, especially when paired with textual cues, consistently yields the best improvements, particularly for smaller models. Based on this insight, we propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning. Stage 1 introduces perception-augmented fine-tuning, and Stage 2 applies perception-aware reinforcement learning with novel visual, textual, and consistency rewards. Experiments demonstrate that VTPerception-R1 significantly improves reasoning accuracy and robustness across diverse tasks, offering a scalable and auditable solution for perception-grounded multimodal reasoning. Our code is available at: https://github.com/yizhuoDi/VTPerceprion-R1.
中文: 多模态大语言模型常难以将推理基于感知证据,而提出的VTPerception-R1框架通过感知与推理解耦的两阶段微调和强化学习,显著提升了多种任务中的准确性和鲁棒性。
English: Multimodal large language models often fail to base reasoning on perceptual evidence, but the proposed VTPerception-R1 framework, which decouples perception from reasoning through two stages of fine-tuning and reinforcement learning, significantly enhances accuracy and robustness across various tasks.

Authors:Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun
Title: ExGS: Extreme 3D Gaussian Compression with Diffusion Priors
Abstract:
Neural scene representations, such as 3D Gaussian Splatting (3DGS), have enabled high-quality neural rendering; however, their large storage and transmission costs hinder deployment in resource-constrained environments. Existing compression methods either rely on costly optimization, which is slow and scene-specific, or adopt training-free pruning and quantization, which degrade rendering quality under high compression ratios. In contrast, recent data-driven approaches provide a promising direction to overcome this trade-off, enabling efficient compression while preserving high rendering quality. We introduce ExGS, a novel feed-forward framework that unifies Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS compression. UGC performs re-optimization-free pruning to aggressively reduce Gaussian primitives while retaining only essential information, whereas GaussPainter leverages powerful diffusion priors with mask-guided refinement to restore high-quality renderings from heavily pruned Gaussian scenes. Unlike conventional inpainting, GaussPainter not only fills in missing regions but also enhances visible pixels, yielding substantial improvements in degraded renderings. To ensure practicality, it adopts a lightweight VAE and a one-step diffusion design, enabling real-time restoration. Our framework can even achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31 MB) while preserving fidelity and significantly improving image quality under challenging conditions. These results highlight the central role of diffusion priors in bridging the gap between extreme compression and high-quality neural rendering. Our code repository will be released at: https://github.com/chenttt2001/ExGS
中文: ExGS框架通过结合无重优化的通用高斯压缩与基于扩散先验的高斯绘制器,实现了3D高斯溅射模型的极致压缩,在超过100倍压缩比下仍能保持高质量渲染效果。
English: The ExGS framework achieves extreme compression of 3D Gaussian Splatting models by combining Universal Gaussian Compression for efficient pruning and GaussPainter with diffusion priors for quality restoration, enabling over 100x size reduction while maintaining high rendering fidelity.

Authors:Longxiang He, Deheng Ye, Junbo Tan, Xueqian Wang, Li Shen
Title: Robust Policy Expansion for Offline-to-Online RL under Diverse Data Corruption
Abstract:
Pretraining a policy on offline data followed by fine-tuning through online interactions, known as Offline-to-Online Reinforcement Learning (O2O RL), has emerged as a promising paradigm for real-world RL deployment. However, both offline datasets and online interactions in practical environments are often noisy or even maliciously corrupted, severely degrading the performance of O2O RL. Existing works primarily focus on mitigating the conservatism of offline policies via online exploration, while the robustness of O2O RL under data corruption, including states, actions, rewards, and dynamics, is still unexplored. In this work, we observe that data corruption induces heavy-tailed behavior in the policy, thereby substantially degrading the efficiency of online exploration. To address this issue, we incorporate Inverse Probability Weighted (IPW) into the online exploration policy to alleviate heavy-tailedness, and propose a novel, simple yet effective method termed $\textbf{RPEX}$: $\textbf{R}$obust $\textbf{P}$olicy $\textbf{EX}$pansion. Extensive experimental results on D4RL datasets demonstrate that RPEX achieves SOTA O2O performance across a wide range of data corruption scenarios. Code is available at $\href{https://github.com/felix-thu/RPEX}{https://github.com/felix-thu/RPEX}$.
中文: 离线到在线强化学习因数据污染导致性能下降,提出的RPEX方法采用逆概率加权技术增强鲁棒性,在多种数据污染场景下取得了最优性能。
English: Offline-to-Online Reinforcement Learning faces performance degradation from data corruption, which is addressed by the proposed RPEX method using Inverse Probability Weighting to enhance robustness and achieve state-of-the-art results.

Authors:Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che
Title: ProxyAttn: Guided Sparse Attention via Representative Heads
Abstract:
The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.
中文: ProxyAttn是一种无需训练的稀疏注意力算法,通过压缩注意力头维度并利用代表性代理头进行细粒度块重要性评估,在长文本任务中实现了高达10.3倍的注意力加速且无明显性能损失。
English: ProxyAttn is a training-free sparse attention algorithm that enhances efficiency in long-text tasks by compressing attention head dimensions and using representative proxy heads for fine-grained block importance estimation, achieving up to 10.3x attention acceleration without significant performance loss.

Authors:Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi
Title: IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
Abstract:
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available. Code is available at https://github.com/L-O-I/IWR-Bench.
中文: 本文提出了IWR-Bench这一新颖基准,用于评估大视觉语言模型从视频重建交互式网页的能力,通过对28个模型的广泛实验发现,尽管视觉保真度尚可,但功能正确性方面仍存在显著挑战。
English: This paper introduces IWR-Bench, a novel benchmark for evaluating Large Vision-Language Models' ability to reconstruct interactive webpages from videos, revealing significant challenges in functional correctness despite moderate visual fidelity through extensive experiments on 28 models.

Authors:Suli Wang, Yang-yang Li, Siqi Cai, Haizhou Li
Title: A Robust Multi-Scale Framework with Test-Time Adaptation for sEEG-Based Speech Decoding
Abstract:
Decoding speech from stereo-electroencephalography (sEEG) signals has emerged as a promising direction for brain-computer interfaces (BCIs). Its clinical applicability, however, is limited by the inherent non-stationarity of neural signals, which causes domain shifts between training and testing, undermining decoding reliability. To address this challenge, a two-stage framework is proposed for enhanced robustness. First, a multi-scale decomposable mixing (MDM) module is introduced to model the hierarchical temporal dynamics of speech production, learning stable multi-timescale representations from sEEG signals. Second, a source-free online test-time adaptation (TTA) method performs entropy minimization to adapt the model to distribution shifts during inference. Evaluations on the public DU-IN spoken word decoding benchmark show that the approach outperforms state-of-the-art models, particularly in challenging cases. This study demonstrates that combining invariant feature learning with online adaptation is a principled strategy for developing reliable BCI systems. Our code is available at https://github.com/lyyi599/MDM-TENT.
中文: 本研究提出一个结合多尺度特征学习和在线适应的两阶段框架,旨在提高从立体脑电图信号解码语音的鲁棒性,并在基准测试中展现出优越性能。
English: This study introduces a two-stage framework combining multi-scale feature learning with online adaptation to enhance the robustness of speech decoding from sEEG signals, demonstrating superior performance on benchmark tests.

Authors:Daniel Pahr, Sara Di Bartolomeo
Title: Investigating the Task Load of Investigating the Task Load in Visualization Studies
Abstract:
The NASA task load index (short: NASA-TLX) is a common metric to evaluate the workload of a user in a visualization study. Yet, it is rarely performed as initially intended, as the sources-of-workload evaluation is often omitted for various reasons. We conduct an online survey to investigate the task load of administering different versions of the NASA-TLX in a meta-study using the ReVISit framework. Our results show that it is not the slight increase in experiment time, but rather participants' frustration with the procedure, that contributes to the slight increase in task load when using the full version of the TLX compared to using a shortened version. However, we also show that the full version can shine a different and more faceted light on workload by adding a personal dimension to the data. We propose that a compact version of the sources-of-workload questionnaire can mitigate both time loss and frustration for study participants, while still providing the same data as the original procedure. The online study can be found and interactively explored on https://dpahr.github.io/tlxtlx/, and the source for the study, as well as the code for our analysis, can be found on https://github.com/dpahr/tlxtlx/.
中文: 研究表明,完整版NASA-TLX虽因参与者挫败感略微增加任务负荷,但能提供更全面的工作负荷视角,建议采用精简版在保持数据质量的同时缓解这些问题。
English: The study reveals that while the full NASA-TLX version slightly increases task load due to participant frustration rather than time, it offers richer workload insights, and a compact version is proposed to reduce these issues while maintaining data quality.

Authors:Gio Paik, Yongbeom Kim, Soungmin Lee, Sangmin Ahn, Chanwoo Kim
Title: HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition
Abstract:
Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model's ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that although most multilingual ASR models initially exhibit inadequate CS-ASR performance, this capability can be enabled through fine-tuning with synthetic CS data. HiKE is available at https://github.com/ThetaOne-AI/HiKE
中文: HiKE是首个全球可访问的韩英语码转换基准,通过提供分层标签和自然数据,系统评估多语言ASR模型性能,并证明利用合成数据微调可有效提升其语码转换处理能力。
English: HiKE is the first globally accessible Korean-English code-switching benchmark that provides hierarchical labels and natural data to systematically evaluate and improve multilingual ASR models' performance through fine-tuning with synthetic data.

Authors:Josip Tomo Licardo, Nikola Tankovic, Darko Etinger
Title: BPMN Assistant: An LLM-Based Approach to Business Process Modeling
Abstract:
This paper presents BPMN Assistant, a tool that leverages Large Language Models (LLMs) for natural language-based creation and editing of BPMN diagrams. A specialized JSON-based representation is introduced as a structured alternative to the direct handling of XML to enhance the accuracy of process modifications. Process generation quality is evaluated using Graph Edit Distance (GED) and Relative Graph Edit Distance (RGED), while editing performance is evaluated with a binary success metric. Results show that JSON and XML achieve similar similarity scores in generation, but JSON offers greater reliability, faster processing, and significantly higher editing success rates. We discuss key trade-offs, limitations, and future improvements. The implementation is available at https://github.com/jtlicardo/bpmn-assistant.
中文: 本文介绍了BPMN Assistant工具,它利用大型语言模型通过自然语言创建和编辑BPMN图,采用JSON格式相比XML提高了准确性和效率,图编辑距离指标和更高的编辑成功率验证了其优势。
English: This paper introduces BPMN Assistant, a tool using LLMs to create and edit BPMN diagrams through natural language, employing a JSON format that improves accuracy and efficiency over XML, as validated by graph distance metrics and higher editing success rates.

Authors:Zidu Wang, Meng Xu, Miao Xu, Hengyuan Ma, Jiankuo Zhao, Xutao Li, Xiangyu Zhu, Zhen Lei
Title: BFSM: 3D Bidirectional Face-Skull Morphable Model
Abstract:
Building a joint face-skull morphable model holds great potential for applications such as remote diagnostics, surgical planning, medical education, and physically based facial simulation. However, realizing this vision is constrained by the scarcity of paired face-skull data, insufficient registration accuracy, and limited exploration of reconstruction and clinical applications. Moreover, individuals with craniofacial deformities are often overlooked, resulting in underrepresentation and limited inclusivity. To address these challenges, we first construct a dataset comprising over 200 samples, including both normal cases and rare craniofacial conditions. Each case contains a CT-based skull, a CT-based face, and a high-fidelity textured face scan. Secondly, we propose a novel dense ray matching registration method that ensures topological consistency across face, skull, and their tissue correspondences. Based on this, we introduce the 3D Bidirectional Face-Skull Morphable Model (BFSM), which enables shape inference between the face and skull through a shared coefficient space, while also modeling tissue thickness variation to support one-to-many facial reconstructions from the same skull, reflecting individual changes such as fat over time. Finally, we demonstrate the potential of BFSM in medical applications, including 3D face-skull reconstruction from a single image and surgical planning prediction. Extensive experiments confirm the robustness and accuracy of our method. BFSM is available at https://github.com/wang-zidu/BFSM
中文: 本研究提出了三维双向人脸-颅骨可变形模型(BFSM),通过共享系数空间实现人脸与颅骨间的形状推断,解决了数据稀缺和配准精度问题,并展示了在医学重建和手术规划中的应用潜力。
English: The study introduces a 3D Bidirectional Face-Skull Morphable Model (BFSM) that enables shape inference between faces and skulls using a shared coefficient space, addressing data scarcity and registration challenges while demonstrating applications in medical reconstruction and surgical planning.

Authors:Peter Hönig, Stefan Thalhammer, Jean-Baptiste Weibel, Matthias Hirschmanner, Markus Vincze
Title: SCOPE: Semantic Conditioning for Sim2Real Category-Level Object Pose Estimation in Robotics
Abstract:
Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9\% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100\%. Code available: https://github.com/hoenigpeter/scope.
中文摘要:SCOPE是一种基于扩散的模型,利用DINOv2特征作为连续语义先验,无需离散类别标签即可实现精确的类别级物体姿态估计,在达到最先进性能的同时能够泛化至未知类别物体。
English Summary: SCOPE is a diffusion-based model that uses DINOv2 features as continuous semantic priors to enable accurate category-level object pose estimation without discrete labels, achieving state-of-the-art performance and generalization to unseen object categories.

Authors:Sophia N. Wilson, Jens Hesselbjerg Christensen, Raghavendra Selvan
Title: Trading Carbon for Physics: On the Resource Efficiency of Machine Learning for Spatio-Temporal Forecasting
Abstract:
Development of modern deep learning methods has been driven primarily by the push for improving model efficacy (accuracy metrics). This sole focus on efficacy has steered development of large-scale models that require massive resources, and results in considerable carbon footprint across the model life-cycle. In this work, we explore how physics inductive biases can offer useful trade-offs between model efficacy and model efficiency (compute, energy, and carbon). We study a variety of models for spatio-temporal forecasting, a task governed by physical laws and well-suited for exploring different levels of physics inductive bias. We show that embedding physics inductive biases into the model design can yield substantial efficiency gains while retaining or even improving efficacy for the tasks under consideration. In addition to using standard physics-informed spatio-temporal models, we demonstrate the usefulness of more recent models like flow matching as a general purpose method for spatio-temporal forecasting. Our experiments show that incorporating physics inductive biases offer a principled way to improve the efficiency and reduce the carbon footprint of machine learning models. We argue that model efficiency, along with model efficacy, should become a core consideration driving machine learning model development and deployment.
中文: 现代深度学习过度追求模型效能导致资源消耗大、碳足迹高,而引入物理归纳偏置能在保持甚至提升性能的同时显著提高效率,主张将效率作为模型开发与部署的核心考量。
English: Modern deep learning's focus on efficacy has led to resource-intensive models with high carbon footprints, but incorporating physics inductive biases can enhance efficiency while maintaining or improving performance, advocating for efficiency as a core development criterion.

Authors:Wenjie Fu, Huandong Wang, Junyao Gao, Guoan Wan, Tao Jiang
Title: Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models
Abstract:
As Large Language Models (LLMs) achieve remarkable success across a wide range of applications, such as chatbots and code copilots, concerns surrounding the generation of harmful content have come increasingly into focus. Despite significant advances in aligning LLMs with safety and ethical standards, adversarial prompts can still be crafted to elicit undesirable responses. Existing mitigation strategies are predominantly based on post-hoc filtering, which introduces substantial latency or computational overhead, and is incompatible with token-level streaming generation. In this work, we introduce Self-Sanitize, a novel LLM-driven mitigation framework inspired by cognitive psychology, which emulates human self-monitor and self-repair behaviors during conversations. Self-Sanitize comprises a lightweight Self-Monitor module that continuously inspects high-level intentions within the LLM at the token level via representation engineering, and a Self-Repair module that performs in-place correction of harmful content without initiating separate review dialogues. This design allows for real-time streaming monitoring and seamless repair, with negligible impact on latency and resource utilization. Given that privacy-invasive content has often been insufficiently focused in previous studies, we perform extensive experiments on four LLMs across three privacy leakage scenarios. The results demonstrate that Self-Sanitize achieves superior mitigation performance with minimal overhead and without degrading the utility of LLMs, offering a practical and robust solution for safer LLM deployments. Our code is available at the following link: https://github.com/wjfu99/LLM_Self_Sanitize
中文: 本文提出受认知心理学启发的Self-Sanitize框架,通过自监控和自修复模块对大型语言模型进行实时有害内容检测与修正,在保证模型效用的同时以最小开销实现卓越的安全防护效果。
English: This paper introduces Self-Sanitize, a lightweight framework inspired by cognitive psychology that enables real-time monitoring and correction of harmful content in LLMs through self-monitoring and self-repair modules, achieving effective mitigation with minimal latency and resource impact.

Authors:Haosi Mo, Xinyu Ma, Xuebo Liu, Derek F. Wong, Yu Li, Jie Liu, Min Zhang
Title: CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task
Abstract:
Recent advances in Large Language Models (LLMs) have significantly enhanced their capabilities, highlighting the need for comprehensive evaluation frameworks that extend beyond task-specific benchmarks. However, existing benchmarks often focus on isolated abilities, lacking a holistic framework for assessing LLM capabilities. To address this gap, we propose the Cognition-Domain-Task (CDT) framework, which comprehensively measures a model's capabilities across three dimensions. We expand the scope of model capability definitions at the cognitive level by incorporating the Cattell-Horn-Carroll cognitive theory, refining the categorization of model capabilities. We apply CDT in two directions: dataset capability evaluation and data selection. Experiments show that our capability metrics correlate well with downstream performance and can support effective dataset analysis and construction. The experiments on data selection also show significant improvements in both general and specific benchmarks, achieving scores of 44.3 and 45.4, with an increase of 1.6 and 2.2 points over the baselines, respectively. These results validate the effectiveness and practicality of CDT. Source code and models are available at https://github.com/Alessa-mo/CDT.
中文摘要:本文提出的CDT框架通过认知-领域-任务三维度全面评估大语言模型能力,实验证明该框架能有效提升基准测试表现并支持数据选择等实际应用。
English Summary: The CDT framework is introduced to holistically evaluate Large Language Models across cognitive, domain, and task dimensions, demonstrating improved performance in benchmarks and practical applications like data selection.

Authors:Kai Liu, Shaoqiu Zhang, Linghe Kong, Yulun Zhang
Title: CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers
Abstract:
Visual generation quality has been greatly promoted with the rapid advances in diffusion transformers (DiTs), which is attributed to the scaling of model size and complexity. However, these attributions also hinder the practical deployment of DiTs on edge devices, limiting their development and application. Serve as an efficient model compression technique, model post-training quantization (PTQ) can reduce the memory consumption and speed up the inference, with inevitable performance degradation. To alleviate the degradation, we propose CLQ, a cross-layer guided orthogonal-based quantization method for DiTs. To be specific, CLQ consists of three key designs. First, we observe that the calibration data used by most of the PTQ methods can not honestly represent the distribution of the activations. Therefore, we propose cross-block calibration (CBC) to obtain accurate calibration data, with which the quantization can be better guided. Second, we propose orthogonal-based smoothing (OBS), which quantifies the outlier score of each channel and leverages block Hadamard matrix to smooth the outliers with negligible overhead. Third, we propose cross-layer parameter searching (CLPS) to search. We evaluate CLQ with both image generation and video generation models and successfully compress the model into W4A4 with negligible degradation in visual quality and metrics. CLQ achieves 3.98x memory saving and 3.95x speedup. Our code is available at \hyperlink{https://github.com/Kai-Liu001/CLQ}{https://github.com/Kai-Liu001/CLQ}.
中文摘要:CLQ是一种针对扩散变换器的跨层引导正交量化方法,通过精确校准数据和优化量化过程,在保持视觉质量的同时实现高效模型压缩,大幅降低内存占用并提升推理速度。
English Summary: CLQ is a novel cross-layer guided orthogonal-based quantization method for diffusion transformers that achieves efficient model compression with minimal performance degradation, enabling significant memory savings and inference speedup.

Authors:Tao Yin, Xiaohong Zhang, Shaochen Fu, Zhibin Zhang, Li Huang, Yiyuan Yang, Kaixiang Yang, Meng Yan
Title: ScatterAD: Temporal-Topological Scattering Mechanism for Time Series Anomaly Detection
Abstract:
One main challenge in time series anomaly detection for industrial IoT lies in the complex spatio-temporal couplings within multivariate data. However, traditional anomaly detection methods focus on modeling spatial or temporal dependencies independently, resulting in suboptimal representation learning and limited sensitivity to anomalous dispersion in high-dimensional spaces. In this work, we conduct an empirical analysis showing that both normal and anomalous samples tend to scatter in high-dimensional space, especially anomalous samples are markedly more dispersed. We formalize this dispersion phenomenon as scattering, quantified by the mean pairwise distance among sample representations, and leverage it as an inductive signal to enhance spatio-temporal anomaly detection. Technically, we propose ScatterAD to model representation scattering across temporal and topological dimensions. ScatterAD incorporates a topological encoder for capturing graph-structured scattering and a temporal encoder for constraining over-scattering through mean squared error minimization between neighboring time steps. We introduce a contrastive fusion mechanism to ensure the complementarity of the learned temporal and topological representations. Additionally, we theoretically show that maximizing the conditional mutual information between temporal and topological views improves cross-view consistency and enhances more discriminative representations. Extensive experiments on multiple public benchmarks show that ScatterAD achieves state-of-the-art performance on multivariate time series anomaly detection. Code is available at this repository: https://github.com/jk-sounds/ScatterAD.
中文: 工业物联网时序异常检测面临复杂时空耦合的挑战,ScatterAD通过将异常分散形式化为散射现象,并利用对比融合机制结合时空与拓扑表征学习,有效提升了检测性能。
English: Industrial IoT time series anomaly detection faces challenges in modeling complex spatio-temporal couplings, which ScatterAD addresses by formalizing anomalous dispersion as scattering and enhancing detection through temporal and topological representation learning with contrastive fusion.

Authors:Khanh Trinh Pham, Thu Huong Nguyen, Jun Jo, Quoc Viet Hung Nguyen, Thanh Tam Nguyen
Title: Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents
Abstract:
Text-to-SQL enables natural access to databases, yet most benchmarks are English-only, limiting multilingual progress. We introduce MultiSpider 2.0, extending Spider 2.0 to eight languages (English, German, French, Spanish, Portuguese, Japanese, Chinese, Vietnamese). It preserves Spider 2.0's structural difficulty while adding linguistic and dialectal variability, demanding deeper reasoning for complex SQL. On this benchmark, state-of-the-art LLMs (such as DeepSeek-R1 and OpenAI o1) reach only 4\% execution accuracy when relying on intrinsic reasoning, versus 60\% on MultiSpider 1.0. Therefore, we provide a collaboration-driven language agents baseline that iteratively refines queries, improving accuracy to 15\%. These results reveal a substantial multilingual gap and motivate methods that are robust across languages and ready for real-world enterprise deployment. Our benchmark is available at https://github.com/phkhanhtrinh23/Multilingual_Text_to_SQL.
中文:MultiSpider 2.0将Spider 2.0扩展至八种语言,揭示了大型语言模型在多语言环境下执行准确率仅为4%的显著差距,并通过协作式语言代理基准将准确率提升至15%。
English: MultiSpider 2.0 extends Spider 2.0 to eight languages, revealing a significant multilingual gap where state-of-the-art LLMs achieve only 4% execution accuracy, and proposes a collaborative language agent baseline that improves accuracy to 15%.

Authors:Song-Ze Yu
Title: From Sound to Setting: AI-Based Equalizer Parameter Prediction for Piano Tone Replication
Abstract:
This project presents an AI-based system for tone replication in music production, focusing on predicting EQ parameter settings directly from audio features. Unlike traditional audio-to-audio methods, our approach outputs interpretable parameter values (e.g., EQ band gains) that musicians can further adjust in their workflow. Using a dataset of piano recordings with systematically varied EQ settings, we evaluate both regression and neural network models. The neural network achieves a mean squared error of 0.0216 on multi-band tasks. The system enables practical, flexible, and automated tone matching for music producers and lays the foundation for extensions to more complex audio effects.
中文: 该项目开发了一种基于人工智能的系统,通过音频特征直接预测均衡器参数以实现音乐制作的自动音色匹配,神经网络模型在多频段任务中表现出色,为音乐制作人提供了可灵活调整的实用解决方案。
English: This project introduces an AI system that predicts EQ parameters from audio features for automated tone matching in music production, achieving high accuracy with a neural network model and offering adjustable, interpretable outputs for practical use.

Authors:Xin Ding, Jianyu Wei, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Hao Wu, Fucheng Jia, Liang Mi, Yuxuan Yan, Weijun Wang, Yunxin Liu, Zhibo Chen, Ting Cao
Title: AdaNav: Adaptive Reasoning with Uncertainty for Vision-Language Navigation
Abstract:
Vision Language Navigation (VLN) requires agents to follow natural language instructions by grounding them in sequential visual observations over long horizons. Explicit reasoning could enhance temporal consistency and perception action alignment, but reasoning at fixed steps often leads to suboptimal performance and unnecessary computation. To address this, we propose AdaNav, an uncertainty-based adaptive reasoning framework for VLN. At its core is the Uncertainty Adaptive Reasoning Block (UAR), a lightweight plugin that dynamically triggers reasoning. We introduce Action Entropy as a policy prior for UAR and progressively refine it through a Heuristics to RL training method, enabling agents to learn difficulty aware reasoning policies under the strict data limitations of embodied tasks. Results show that with only 6K training samples, AdaNav achieves substantial gains over closed source models trained on million scale data, improving success rate by 20% on R2R val-unseen, 11.7% on RxR-CE, and 11.4% in real world scenes. The code is available at https://github.com/xinding-sys/AdaNav.
中文: AdaNav提出了一种基于不确定性的自适应推理框架,通过动态触发轻量级推理模块,在少量训练数据下显著提升了视觉语言导航任务的性能表现。
English: AdaNav introduces an uncertainty-based adaptive reasoning framework that dynamically triggers lightweight reasoning blocks, achieving significant performance improvements in Vision Language Navigation with minimal training data.

Authors:Junyi Gu, Beatriz Cabrero-Daniel, Ali Nouri, Lydia Armini, Christian Berger
Title: PCICF: A Pedestrian Crossing Identification and Classification Framework
Abstract:
We have recently observed the commercial roll-out of robotaxis in various countries. They are deployed within an operational design domain (ODD) on specific routes and environmental conditions, and are subject to continuous monitoring to regain control in safety-critical situations. Since ODDs typically cover urban areas, robotaxis must reliably detect vulnerable road users (VRUs) such as pedestrians, bicyclists, or e-scooter riders. To better handle such varied traffic situations, end-to-end AI, which directly compute vehicle control actions from multi-modal sensor data instead of only for perception, is on the rise. High quality data is needed for systematically training and evaluating such systems within their OOD. In this work, we propose PCICF, a framework to systematically identify and classify VRU situations to support ODD's incident analysis. We base our work on the existing synthetic dataset SMIRK, and enhance it by extending its single-pedestrian-only design into the MoreSMIRK dataset, a structured dictionary of multi-pedestrian crossing situations constructed systematically. We then use space-filling curves (SFCs) to transform multi-dimensional features of scenarios into characteristic patterns, which we match with corresponding entries in MoreSMIRK. We evaluate PCICF with the large real-world dataset PIE, which contains more than 150 manually annotated pedestrian crossing videos. We show that PCICF can successfully identify and classify complex pedestrian crossings, even when groups of pedestrians merge or split. By leveraging computationally efficient components like SFCs, PCICF has even potential to be used onboard of robotaxis for OOD detection for example. We share an open-source replication package for PCICF containing its algorithms, the complete MoreSMIRK dataset and dictionary, as well as our experiment results presented in: https://github.com/Claud1234/PCICF
中文摘要:PCICF框架通过增强的MoreSMIRK数据集和空间填充曲线技术,能系统识别和分类自动驾驶出租车运行中的弱势道路使用者场景,有效解析行人群体分流合并等复杂行为,提升运营安全分析能力。
English Summary: The PCICF framework is introduced to systematically identify and classify vulnerable road user situations for robotaxi operational safety, leveraging the enhanced MoreSMIRK dataset and space-filling curves to effectively analyze complex pedestrian interactions.

Authors:Shihao Qi, Jie Ma, Ziang Yin, Lingling Zhang, Jian Zhang, Jun Liu, Feng Tian, Tongliang Liu
Title: Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs
Abstract:
Existing methods usually leverage a fixed strategy, such as natural language reasoning, code-augmented reasoning, tool-integrated reasoning, or ensemble-based reasoning, to guide Large Language Models (LLMs) to perform mathematical reasoning. Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency. To address these issues, we propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution. Specifically, we first curate a multi-strategy preference dataset, which we call MathStrat, capturing correctness, process quality, and computational efficiency for each problem--strategy pair. Then, we train a lightweight Strategy Adapter based on the dataset to obtain confidence distributions over the mentioned four reasoning strategies. At inference time, an adaptive routing policy dynamically tailors the reasoning approach based on predictor confidence. It directs the model to use single-strategy execution for high-confidence predictions, dual-strategy verification for competitive scenarios, or comprehensive multi-strategy exploration for uncertain cases. Extensive experiments across five mathematical reasoning benchmarks demonstrate that PRISM consistently outperforms individual strategies and ensemble baselines, achieving improvements ranging from 0.9% to 7.6% across different base models. The adaptive routing approach shows particularly strong benefits for mathematical reasoning tasks across diverse model architectures. Our code is released at https://github.com/reml-group/PRISM.
中文: 提出的PRISM框架通过策略规划与定向执行的两阶段过程,自适应地为大语言模型选择最适合的数学推理策略,在多个基准测试中均优于固定策略方法。
English: The proposed PRISM framework enhances mathematical reasoning in LLMs by adaptively selecting the most suitable strategy through a two-stage process of planning and execution, outperforming fixed-strategy approaches across multiple benchmarks.

Authors:Hao Chen, Fang Xu, Tamer Saleh, Weifeng Hao, Gui-Song Xia
Title: Mask Clustering-based Annotation Engine for Large-Scale Submeter Land Cover Mapping
Abstract:
Recent advances in remote sensing technology have made submeter resolution imagery increasingly accessible, offering remarkable detail for fine-grained land cover analysis. However, its full potential remains underutilized - particularly for large-scale land cover mapping - due to the lack of sufficient, high-quality annotated datasets. Existing labels are typically derived from pre-existing products or manual annotation, which are often unreliable or prohibitively expensive, particularly given the rich visual detail and massive data volumes of submeter imagery. Inspired by the spatial autocorrelation principle, which suggests that objects of the same class tend to co-occur with similar visual features in local neighborhoods, we propose the Mask Clustering-based Annotation Engine (MCAE), which treats semantically consistent mask groups as the minimal annotating units to enable efficient, simultaneous annotation of multiple instances. It significantly improves annotation efficiency by one to two orders of magnitude, while preserving label quality, semantic diversity, and spatial representativeness. With MCAE, we build a high-quality annotated dataset of about 14 billion labeled pixels, referred to as HiCity-LC, which supports the generation of city-scale land cover maps across five major Chinese cities with classification accuracies above 85%. It is the first publicly available submeter resolution city-level land cover benchmark, highlighting the scalability and practical utility of MCAE for large-scale, submeter resolution mapping. The dataset is available at https://github.com/chenhaocs/MCAE
Chinese: 基于掩码聚类的标注引擎(MCAE)利用空间自相关性实现了亚米级影像的高效大规模标注,构建了包含140亿标记像素的HiCity-LC数据集,在城市尺度土地覆盖制图中达到85%以上的分类精度。
English: The Mask Clustering-based Annotation Engine (MCAE) leverages spatial autocorrelation to enable efficient large-scale annotation of submeter imagery, producing the HiCity-LC dataset with 14 billion labeled pixels and over 85% classification accuracy for city-scale land cover mapping.

Authors:Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen
Title: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning
Abstract:
Fine-tuning pre-trained large language models (LLMs) for down-stream tasks is a critical step in the AI deployment pipeline. Reinforcement learning (RL) is arguably the most prominent fine-tuning method, contributing to the birth of many state-of-the-art LLMs. In contrast, evolution strategies (ES), which once showed comparable performance to RL on models with a few million parameters, was neglected due to the pessimistic perception of its scalability to larger models. In this work, we report the first successful attempt to scale up ES for fine-tuning the full parameters of LLMs, showing the surprising fact that ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects, including sample efficiency, tolerance to long-horizon rewards, robustness to different base LLMs, less tendency to reward hacking, and more stable performance across runs. It therefore serves as a basis to unlock a new direction in LLM fine-tuning beyond what current RL techniques provide. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.
中文: 本研究首次成功将进化策略扩展用于大语言模型的全参数微调,证明其在样本效率、奖励稳定性及抗干扰能力等方面优于主流强化学习方法。
English: This study successfully scales evolution strategies (ES) to fine-tune large language models, demonstrating that ES outperforms reinforcement learning in efficiency, robustness, and stability across multiple metrics.

Authors:Congjia Chen, Yufu Qu
Title: DINOReg: Strong Point Cloud Registration with Vision Foundation Model
Abstract:
Point cloud registration is a fundamental task in 3D computer vision. Most existing methods rely solely on geometric information for feature extraction and matching. Recently, several studies have incorporated color information from RGB-D data into feature extraction. Although these methods achieve remarkable improvements, they have not fully exploited the abundant texture and semantic information in images, and the feature fusion is performed in an image-lossy manner, which limit their performance. In this paper, we propose DINOReg, a registration network that sufficiently utilizes both visual and geometric information to solve the point cloud registration problem. Inspired by advances in vision foundation models, we employ DINOv2 to extract informative visual features from images, and fuse visual and geometric features at the patch level. This design effectively combines the rich texture and global semantic information extracted by DINOv2 with the detailed geometric structure information captured by the geometric backbone. Additionally, a mixed positional embedding is proposed to encode positional information from both image space and point cloud space, which enhances the model's ability to perceive spatial relationships between patches. Extensive experiments on the RGBD-3DMatch and RGBD-3DLoMatch datasets demonstrate that our method achieves significant improvements over state-of-the-art geometry-only and multi-modal registration methods, with a 14.2% increase in patch inlier ratio and a 15.7% increase in registration recall. The code is publicly available at https://github.com/ccjccjccj/DINOReg.
中文: DINOReg是一种新颖的点云配准网络,通过结合DINOv2提取的视觉特征与几何特征进行补丁级融合,并采用混合位置编码,在多个数据集上实现了显著优于现有方法的性能提升。
English: DINOReg is a novel point cloud registration network that effectively integrates visual features from DINOv2 with geometric data through patch-level fusion and mixed positional embedding, achieving significant performance improvements over existing methods.

Authors:Jitai Hao, Hao Liu, Xinyan Xiao, Qiang Huang, Jun Yu
Title: Uni-X: Mitigating Modality Conflict with a Two-End-Separated Architecture for Unified Multimodal Models
Abstract:
Unified Multimodal Models (UMMs) built on shared autoregressive (AR) transformers are attractive for their architectural simplicity. However, we identify a critical limitation: when trained on multimodal inputs, modality-shared transformers suffer from severe gradient conflicts between vision and text, particularly in shallow and deep layers. We trace this issue to the fundamentally different low-level statistical properties of images and text, while noting that conflicts diminish in middle layers where representations become more abstract and semantically aligned. To overcome this challenge, we propose Uni-X, a two-end-separated, middle-shared architecture. Uni-X dedicates its initial and final layers to modality-specific processing, while maintaining shared parameters in the middle layers for high-level semantic fusion. This X-shaped design not only eliminates gradient conflicts at both ends but also further alleviates residual conflicts in the shared layers. Extensive experiments validate the effectiveness of Uni-X. Under identical training conditions, Uni-X achieves superior training efficiency compared to strong baselines. When scaled to 3B parameters with larger training data, Uni-X matches or surpasses 7B AR-based UMMs, achieving a GenEval score of 82 for image generation alongside strong performance in text and vision understanding tasks. These results establish Uni-X as a parameter-efficient and scalable foundation for future unified multimodal modeling. Our code is available at https://github.com/CURRENTF/Uni-X
中文:Uni-X架构通过两端分离模态处理、中间共享参数的方式解决了多模态模型中的梯度冲突问题,以更少的参数实现了更高的效率和性能。
English: The Uni-X architecture addresses gradient conflicts in multimodal models by separating modality-specific processing at the ends while sharing middle layers, achieving superior efficiency and performance with fewer parameters.

Authors:Hao Yang, Weijie Qiu, Ru Zhang, Zhou Fang, Ruichao Mao, Xiaoyu Lin, Maji Huang, Zhaosong Huang, Teng Guo, Shuoyang Liu, Hai Rao
Title: UI-UG: A Unified MLLM for UI Understanding and Generation
Abstract:
Although Multimodal Large Language Models (MLLMs) have been widely applied across domains, they are still facing challenges in domain-specific tasks, such as User Interface (UI) understanding accuracy and UI generation quality. In this paper, we introduce UI-UG (a unified MLLM for UI Understanding and Generation), integrating both capabilities. For understanding tasks, we employ Supervised Fine-tuning (SFT) combined with Group Relative Policy Optimization (GRPO) to enhance fine-grained understanding on the modern complex UI data. For generation tasks, we further use Direct Preference Optimization (DPO) to make our model generate human-preferred UIs. In addition, we propose an industrially effective workflow, including the design of an LLM-friendly domain-specific language (DSL), training strategies, rendering processes, and evaluation metrics. In experiments, our model achieves state-of-the-art (SOTA) performance on understanding tasks, outperforming both larger general-purpose MLLMs and similarly-sized UI-specialized models. Our model is also on par with these larger MLLMs in UI generation performance at a fraction of the computational cost. We also demonstrate that integrating understanding and generation tasks can improve accuracy and quality for both tasks. Code and Model: https://github.com/neovateai/UI-UG
中文: 本文提出UI-UG这一统一多模态大语言模型,整合了用户界面理解与生成能力,在理解任务上达到最优性能,并以更低计算成本实现了与更大模型相当的界面生成质量。
English: This paper introduces UI-UG, a unified Multimodal Large Language Model that integrates UI understanding and generation, achieving state-of-the-art performance in understanding tasks and comparable generation quality to larger models with significantly lower computational cost.

Authors:Jie Ma, Shihao Qi, Rui Xing, Ziang Yin, Bifan Wei, Jun Liu, Tongliang Liu
Title: From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision
Abstract:
The quality of process data plays a key role in training a Process Reward Model (PRM), which can enhance the complex mathematical reasoning capability of large language models. Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy and navigate a vast search space to perform path expansion during the automated data generation process, resulting in their inefficiency and inflexibility. To address these issues, we propose Adaptive Monte Carlo Search (AMCS), a framework that transforms data generation from fixed, static to adaptive, dynamic search at the level of node value estimation and path expansion. On one hand, AMCS adaptively refines estimation by allocating more samples to uncertain reasoning steps while using fewer samples for those that are easier to estimate. On the other hand, it enhances the path expansion through a Monte Carlo algorithm with a temporally adaptive policy that begins with broad exploration and gradually shifts toward exploiting the most promising directions. With AMCS, we construct a large-scale dataset MathSearch-200K of about 200K process supervision examples for training PRMs. To verify the effectiveness of our method, we conduct extensive experiments on four mathematical reasoning benchmarks. Experimental results show that Qwen2.5-Math-7B-PRM-AMCS achieves up to 76.2% accuracy on MATH500 with GLM-4-9B, outperforming all baseline PRMs. Notably, a 7B model supervised by Qwen2.5-Math-7B-PRM-AMCS surpasses a 72B model with weaker supervision. Moreover, Qwen2.5-Math-7B-PRM-AMCS maintains consistent advantages on out-of-distribution problems, demonstrating strong generalization capability. Our code is available at https://github.com/reml-group/AMCS.
中文: 本文提出自适应蒙特卡洛搜索(AMCS)框架,通过动态调整节点评估和路径扩展策略,显著提升了过程监督数据的生成效率,在数学推理基准测试中全面超越现有方法并展现出卓越的泛化能力。
English: This paper introduces Adaptive Monte Carlo Search (AMCS), a dynamic framework that enhances the efficiency and flexibility of generating process supervision data for training Process Reward Models, leading to superior performance on mathematical reasoning benchmarks compared to existing methods.

Authors:Sarmistha Das, Priya Mathur, Ishani Sharma, Sriparna Saha, Kitsuchart Pasupa, Alka Maurya
Title: Fin-Ally: Pioneering the Development of an Advanced, Commonsense-Embedded Conversational AI for Money Matters
Abstract:
The exponential technological breakthrough of the FinTech industry has significantly enhanced user engagement through sophisticated advisory chatbots. However, large-scale fine-tuning of LLMs can occasionally yield unprofessional or flippant remarks, such as ``With that money, you're going to change the world,'' which, though factually correct, can be contextually inappropriate and erode user trust. The scarcity of domain-specific datasets has led previous studies to focus on isolated components, such as reasoning-aware frameworks or the enhancement of human-like response generation. To address this research gap, we present Fin-Solution 2.O, an advanced solution that 1) introduces the multi-turn financial conversational dataset, Fin-Vault, and 2) incorporates a unified model, Fin-Ally, which integrates commonsense reasoning, politeness, and human-like conversational dynamics. Fin-Ally is powered by COMET-BART-embedded commonsense context and optimized with a Direct Preference Optimization (DPO) mechanism to generate human-aligned responses. The novel Fin-Vault dataset, consisting of 1,417 annotated multi-turn dialogues, enables Fin-Ally to extend beyond basic account management to provide personalized budgeting, real-time expense tracking, and automated financial planning. Our comprehensive results demonstrate that incorporating commonsense context enables language models to generate more refined, textually precise, and professionally grounded financial guidance, positioning this approach as a next-generation AI solution for the FinTech sector. Dataset and codes are available at: https://github.com/sarmistha-D/Fin-Ally
中文摘要:Fin-Solution 2.O通过引入Fin-Vault多轮对话数据集和集成常识推理的Fin-Ally统一模型,解决了金融聊天机器人因领域数据稀缺导致回复不专业的问题,可生成更精准且符合人类偏好的财务指导。
English Summary: Fin-Solution 2.O introduces the Fin-Vault dataset and Fin-Ally model to address the challenge of generating contextually appropriate and professional financial advice by integrating commonsense reasoning and human-like conversational dynamics.

Authors:Mengyu Bu, Shaolei Zhang, Zhongjun He, Hua Wu, Yang Feng
Title: AlignX: Advancing Multilingual Large Language Models with Multilingual Representation Alignment
Abstract:
Multilingual large language models (LLMs) possess impressive multilingual understanding and generation capabilities. However, their performance and cross-lingual alignment often lag for non-dominant languages. A common solution is to fine-tune LLMs on large-scale and more balanced multilingual corpus, but such approaches often lead to imprecise alignment and suboptimal knowledge transfer, struggling with limited improvements across languages. In this paper, we propose AlignX to bridge the multilingual performance gap, which is a two-stage representation-level framework for enhancing multilingual performance of pre-trained LLMs. In the first stage, we align multilingual representations with multilingual semantic alignment and language feature integration. In the second stage, we stimulate the multilingual capability of LLMs via multilingual instruction fine-tuning. Experimental results on several pre-trained LLMs demonstrate that our approach enhances LLMs' multilingual general and cross-lingual generation capability. Further analysis indicates that AlignX brings the multilingual representations closer and improves the cross-lingual alignment.
Chinese: AlignX框架通过语义对齐和指令微调两阶段方法,有效提升多语言大模型的跨语言生成能力与表征一致性,缩小非主流语言的性能差距。
English: The proposed AlignX framework enhances multilingual LLMs by aligning representations through semantic alignment and instruction fine-tuning, improving cross-lingual performance and closing the performance gap for non-dominant languages.

Authors:Wankun Chen, Feng Gao, Yanhai Gan, Jingchao Cao, Junyu Dong, Qian Du
Title: Wavelet-Assisted Mamba for Satellite-Derived Sea Surface Temperature Super-Resolution
Abstract:
Sea surface temperature (SST) is an essential indicator of global climate change and one of the most intuitive factors reflecting ocean conditions. Obtaining high-resolution SST data remains challenging due to limitations in physical imaging, and super-resolution via deep neural networks is a promising solution. Recently, Mamba-based approaches leveraging State Space Models (SSM) have demonstrated significant potential for long-range dependency modeling with linear complexity. However, their application to SST data super-resolution remains largely unexplored. To this end, we propose the Wavelet-assisted Mamba Super-Resolution (WMSR) framework for satellite-derived SST data. The WMSR includes two key components: the Low-Frequency State Space Module (LFSSM) and High-Frequency Enhancement Module (HFEM). The LFSSM uses 2D-SSM to capture global information of the input data, and the robust global modeling capabilities of SSM are exploited to preserve the critical temperature information in the low-frequency component. The HFEM employs the pixel difference convolution to match and correct the high-frequency feature, achieving accurate and clear textures. Through comprehensive experiments on three SST datasets, our WMSR demonstrated superior performance over state-of-the-art methods. Our codes and datasets will be made publicly available at https://github.com/oucailab/WMSR.
中文: 本文提出了基于小波辅助的Mamba超分辨率(WMSR)框架,利用状态空间模型和小波处理提升海表温度数据分辨率,在实验中展现出优于现有方法的性能。
English: This paper introduces the Wavelet-assisted Mamba Super-Resolution (WMSR) framework, which utilizes state space models and wavelet processing to enhance the resolution of sea surface temperature data, demonstrating superior performance in experiments.

Authors:Kun Wang, Guibin Zhang, ManKit Ye, Xinyu Deng, Dongxia Wang, Xiaobin Hu, Jinyang Guo, Yang Liu, Yufei Guo
Title: MAS$^2$: Self-Generative, Self-Configuring, Self-Rectifying Multi-Agent Systems
Abstract:
The past two years have witnessed the meteoric rise of Large Language Model (LLM)-powered multi-agent systems (MAS), which harness collective intelligence and exhibit a remarkable trajectory toward self-evolution. This paradigm has rapidly progressed from manually engineered systems that require bespoke configuration of prompts, tools, roles, and communication protocols toward frameworks capable of automated orchestration. Yet, dominant automatic multi-agent systems, whether generated by external modules or a single LLM agent, largely adhere to a rigid ``\textit{generate-once-and-deploy}'' paradigm, rendering the resulting systems brittle and ill-prepared for the dynamism and uncertainty of real-world environments. To transcend this limitation, we introduce MAS$^2$, a paradigm predicated on the principle of recursive self-generation: a multi-agent system that autonomously architects bespoke multi-agent systems for diverse problems. Technically, we devise a ``\textit{generator-implementer-rectifier}'' tri-agent team capable of dynamically composing and adaptively rectifying a target agent system in response to real-time task demands. Collaborative Tree Optimization is proposed to train and specialize these meta-agents. Extensive evaluation across seven benchmarks reveals that MAS$^2$ achieves performance gains of up to $19.6\%$ over state-of-the-art MAS in complex scenarios such as deep research and code generation. Moreover, MAS$^2$ exhibits superior cross-backbone generalization, effectively leveraging previously unseen LLMs to yield improvements of up to $15.1\%$. Crucially, these gains are attained without incurring excessive token costs, as MAS$^2$ consistently resides on the Pareto frontier of cost-performance trade-offs. The source codes are available at https://github.com/yeyeyeah2/MAS2.
中文摘要:MAS$^2$框架提出递归自生成范式,通过生成器-执行器-修正器三元智能体团队动态构建并自适应调整多智能体系统,在复杂场景中实现高达19.6%的性能提升,同时保持最优的效能成本比。
English Summary: The MAS$^2$ framework introduces a recursive self-generation paradigm where a tri-agent team dynamically composes and adaptively rectifies multi-agent systems, achieving up to 19.6% performance gains in complex scenarios while maintaining optimal cost-efficiency.

Authors:Yuntao Shou, Tao Meng, Wei Ai, Keqin Li
Title: Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey
Abstract:
In recent years, large language models (LLMs) have driven major advances in language understanding, marking a significant step toward artificial general intelligence (AGI). With increasing demands for higher-level semantics and cross-modal fusion, multimodal large language models (MLLMs) have emerged, integrating diverse information sources (e.g., text, vision, and audio) to enhance modeling and reasoning in complex scenarios. In AI for Science, multimodal emotion recognition and reasoning has become a rapidly growing frontier. While LLMs and MLLMs have achieved notable progress in this area, the field still lacks a systematic review that consolidates recent developments. To address this gap, this paper provides a comprehensive survey of LLMs and MLLMs for emotion recognition and reasoning, covering model architectures, datasets, and performance benchmarks. We further highlight key challenges and outline future research directions, aiming to offer researchers both an authoritative reference and practical insights for advancing this domain. To the best of our knowledge, this paper is the first attempt to comprehensively survey the intersection of MLLMs with multimodal emotion recognition and reasoning. The summary of existing methods mentioned is in our Github: \href{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}{https://github.com/yuntaoshou/Awesome-Emotion-Reasoning}.
中文: 本文首次系统综述了用于多模态情感识别与推理的大语言模型及多模态大语言模型,涵盖架构、数据集与性能基准,并指出了关键挑战与未来研究方向。
English: This paper provides the first comprehensive survey of LLMs and MLLMs for multimodal emotion recognition and reasoning, covering architectures, datasets, and benchmarks while identifying key challenges and future directions.

Authors:Dipan Maity
Title: AuON: A Linear-time Alternative to Semi-Orthogonal Momentum Updates
Abstract:
Orthogonal gradient updates have emerged as a promising direction in optimization for machine learning. However, traditional approaches such as SVD/QR decomposition incur prohibitive computational costs of O(n^3) and underperform compared to well-tuned SGD with momentum, since momentum is applied only after strict orthogonalization. Recent advances, such as Muon, improve efficiency by applying momentum before orthogonalization and producing semi-orthogonal matrices via Newton-Schulz iterations, reducing complexity to O(n^2). Nevertheless, quadratic costs remain a bottleneck. In this work, we study the semi-orthogonal properties of momentum-based updates and develop a method to bound momentum updates under a spectral-norm trust region, preserving directional information without requiring explicit semi-orthogonalization. We propose AuON (Alternative Unit-norm momentum updates by Normalized nonlinear scaling), a linear-time optimizer that achieves strong performance without constructing semi-orthogonal matrices, while preserving structural alignment and reconditioning ill-posed updates. Our approach combines hyperbolic-cosine RMS scaling transformations with normalization, demonstrating both effectiveness and computational efficiency compared to Newton-Schulz methods. We further introduce a hybrid variant (Hybrid-AuON) that applies a single Newton-Schulz iteration. Experiments across vision and language benchmarks show that AuON and its hybrid variant achieve performance comparable to strong baselines such as AdamW and Muon. Code is available at: https://github.com/ryyzn9/AuON
中文摘要:AuON是一种线性时间优化器,通过归一化非线性缩放保持动量更新的方向对齐而无需构建半正交矩阵,在视觉和语言任务中实现了与AdamW和Muon相媲美的性能。
English Summary: AuON is a linear-time optimizer that uses normalized nonlinear scaling to preserve directional alignment in momentum updates without costly semi-orthogonal matrix construction, achieving competitive performance with AdamW and Muon across vision and language tasks.

Authors:Korbinian Moller, Roland Stroop, Mattia Piccinini, Alexander Langmann, Johannes Betz
Title: Learning to Sample: Reinforcement Learning-Guided Sampling for Autonomous Vehicle Motion Planning
Abstract:
Sampling-based motion planning is a well-established approach in autonomous driving, valued for its modularity and analytical tractability. In complex urban scenarios, however, uniform or heuristic sampling often produces many infeasible or irrelevant trajectories. We address this limitation with a hybrid framework that learns where to sample while keeping trajectory generation and evaluation fully analytical and verifiable. A reinforcement learning (RL) agent guides the sampling process toward regions of the action space likely to yield feasible trajectories, while evaluation and final selection remains governed by deterministic feasibility checks and cost functions. We couple the RL sampler with a world model (WM) based on a decodable deep set encoder, enabling both variable numbers of traffic participants and reconstructable latent representations. The approach is evaluated in the CommonRoad simulation environment, showing up to 99% fewer required samples and a runtime reduction of up to 84% while maintaining planning quality in terms of success and collision-free rates. These improvements lead to faster, more reliable decision-making for autonomous vehicles in urban environments, achieving safer and more responsive navigation under real-world constraints. Code and trained artifacts are publicly available at: https://github.com/TUM-AVS/Learning-to-Sample
中文摘要:该混合框架通过强化学习引导自动驾驶的轨迹采样,在保持规划质量的同时将所需样本减少高达99%,运行时间降低84%,并通过确定性验证确保安全性。
English Summary: The proposed hybrid framework integrates reinforcement learning to guide trajectory sampling in autonomous driving, significantly reducing required samples and runtime by up to 84% while maintaining planning quality through analytical verification.

Authors:Zherui Li, Zheng Nie, Zhenhong Zhou, Yufei Guo, Yue Liu, Yitong Zhang, Yu Cheng, Qingsong Wen, Kun Wang, Jiaheng Zhang
Title: DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Abstract:
The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency. Our code is available at: https://github.com/niez233/DiffuGuard.
中文:扩散大语言模型因其迭代生成机制存在独特的越狱攻击漏洞,为此提出的无需训练的防御框架DiffuGuard能显著降低攻击成功率,同时保持模型性能。
English: Diffusion Large Language Models (dLLMs) exhibit unique vulnerabilities to jailbreak attacks due to their iterative generation process, prompting the development of DiffuGuard, a training-free defense that significantly reduces attack success rates while maintaining model performance.

Authors:An Dao, Vu Tran, Le-Minh Nguyen, Yuji Matsumoto
Title: Overview of SCIDOCA 2025 Shared Task on Citation Prediction, Discovery, and Placement
Abstract:
We present an overview of the SCIDOCA 2025 Shared Task, which focuses on citation discovery and prediction in scientific documents. The task is divided into three subtasks: (1) Citation Discovery, where systems must identify relevant references for a given paragraph; (2) Masked Citation Prediction, which requires selecting the correct citation for masked citation slots; and (3) Citation Sentence Prediction, where systems must determine the correct reference for each cited sentence. We release a large-scale dataset constructed from the Semantic Scholar Open Research Corpus (S2ORC), containing over 60,000 annotated paragraphs and a curated reference set. The test set consists of 1,000 paragraphs from distinct papers, each annotated with ground-truth citations and distractor candidates. A total of seven teams registered, with three submitting results. We report performance metrics across all subtasks and analyze the effectiveness of submitted systems. This shared task provides a new benchmark for evaluating citation modeling and encourages future research in scientific document understanding. The dataset and task materials are publicly available at https://github.com/daotuanan/scidoca2025-shared-task.
中文摘要:SCIDOCA 2025共享任务通过基于S2ORC构建的大规模数据集,设置了三个引文处理子任务作为引文建模新基准,共有七支团队参与,其公开成果将推动科学文献理解研究。
English Summary: The SCIDOCA 2025 Shared Task introduces a new benchmark for citation modeling through three subtasks using a large-scale dataset from S2ORC, with seven teams participating and results publicly available for advancing scientific document understanding.

Authors:Nimisha Ghosh, Dheeran Sankaran, Rahul Balakrishnan Adhi, Sharath S, Amrut Anand
Title: LAMP-PRo: Label-aware Attention for Multi-label Prediction of DNA- and RNA-binding Proteins using Protein Language Models
Abstract:
Identifying DNA- (DBPs) and RNA-binding proteins (RBPs) is crucial for the understanding of cell function, molecular interactions as well as regulatory functions. Owing to their high similarity, most of the existing approaches face challenges in differentiating between DBPs and RBPs leading to high cross-prediction errors. Moreover, identifying proteins which bind to both DNA and RNA (DRBPs) is also quite a challenging task. In this regard, we propose a novel framework viz. LAMP-PRo which is based on pre-trained protein language model (PLM), attention mechanisms and multi-label learning to mitigate these issues. First, pre-trained PLM such ESM-2 is used for embedding the protein sequences followed by convolutional neural network (CNN). Subsequently multi-head self-attention mechanism is applied for the contextual information while label-aware attention is used to compute class-specific representations by attending to the sequence in a way that is tailored to each label (DBP, RBP and non-NABP) in a multi-label setup. We have also included a novel cross-label attention mechanism to explicitly capture dependencies between DNA- and RNA-binding proteins, enabling more accurate prediction of DRBP. Finally, a linear layer followed by a sigmoid function are used for the final prediction. Extensive experiments are carried out to compare LAMP-PRo with the existing methods wherein the proposed model shows consistent competent performance. Furthermore, we also provide visualization to showcase model interpretability, highlighting which parts of the sequence are most relevant for a predicted label. The original datasets are available at http://bliulab.net/iDRBP\_MMC and the codes are available at https://github.com/NimishaGhosh/LAMP-PRo.
中文: LAMP-PRo框架通过预训练蛋白质语言模型、注意力机制和多标签学习,能准确区分DNA和RNA结合蛋白,并有效识别双重结合蛋白。
English: The proposed LAMP-PRo framework utilizes pre-trained protein language models, attention mechanisms, and multi-label learning to accurately differentiate between DNA- and RNA-binding proteins while effectively identifying dual-binding proteins.

Authors:Rubing Yang, Huajun Bai, Song Liu, Guanghua Yu, Runzhi Fan, Yanbin Dang, Jiejing Zhang, Kai Liu, Jianchen Zhu, Peng Chen
Title: SpecExit: Accelerating Large Reasoning Model via Speculative Exit
Abstract:
Despite their strong performance on reasoning tasks, large reasoning models (LRMs) often suffer from overthinking, producing unnecessarily long outputs and incurring high end-to-end latency, a significant limitation to their real-world deployment. To address overthinking, early-exit mechanisms have been proposed to terminate reasoning before typical completion, showing that this approach can effectively shorten generation length with minimal impact on accuracy. However, their reliance on probing mechanisms introduces a detection overhead that limits their end-to-end latency gains and compromises their generalizability across diverse problems. Inspired by the use of hidden states in speculative decoding, we propose SpecExit, a novel framework that predicts both future tokens and an early-exit signal directly from a lightweight draft model without probing overhead. Our method offers significant improvements, reducing average generation length by 66\% and achieving a 2.5x speedup in end-to-end latency compared to the speculative decoding baseline, without compromising accuracy. Our method leverages the inherent signals from hidden states to provide effective early-exit signals, suggesting broader use of hidden states for efficient reasoning. Our code is available at https://github.com/Tencent/AngelSlim.
Chinese: SpecExit 是一种新颖框架,利用轻量级草稿模型预测令牌和提前退出信号,在不损失准确性的前提下将生成长度减少 66%,实现 2.5 倍加速。
English: SpecExit is a novel framework that uses a lightweight draft model to predict tokens and early-exit signals, reducing generation length by 66% and achieving 2.5x speedup without accuracy loss.

Authors:Siyan Dong, Zijun Wang, Lulu Cai, Yi Ma, Yanchao Yang
Title: PROFusion: Robust and Accurate Dense Reconstruction via Camera Pose Regression and Optimization
Abstract:
Real-time dense scene reconstruction during unstable camera motions is crucial for robotics, yet current RGB-D SLAM systems fail when cameras experience large viewpoint changes, fast motions, or sudden shaking. Classical optimization-based methods deliver high accuracy but fail with poor initialization during large motions, while learning-based approaches provide robustness but lack sufficient accuracy for dense reconstruction. We address this challenge through a combination of learning-based initialization with optimization-based refinement. Our method employs a camera pose regression network to predict metric-aware relative poses from consecutive RGB-D frames, which serve as reliable starting points for a randomized optimization algorithm that further aligns depth images with the scene geometry. Extensive experiments demonstrate promising results: our approach outperforms the best competitor on challenging benchmarks, while maintaining comparable accuracy on stable motion sequences. The system operates in real-time, showcasing that combining simple and principled techniques can achieve both robustness for unstable motions and accuracy for dense reconstruction. Project page: https://github.com/siyandong/PROFusion.
中文: 我们的方法结合了基于学习的位姿初始化和基于优化的精细化处理,在相机运动不稳定的情况下实现了鲁棒的实时稠密场景重建,在挑战性场景中优于竞争对手,同时在稳定序列中保持精度。
English: Our method combines learning-based pose initialization with optimization-based refinement to achieve robust real-time dense scene reconstruction under unstable camera motions, outperforming competitors in challenging scenarios while maintaining accuracy in stable sequences.

Authors:Junjie Wang, Pan Zhou, Yiming Dong, Huan Li, Jia Li, Xun Zhou, Qicheng Lao, Cong Fang, Zhouchen Lin
Title: Conda: Column-Normalized Adam for Training Large Language Models Faster
Abstract:
Large language models (LLMs) have demonstrated impressive generalization and emergent capabilities, yet their pre-training remains computationally expensive and sensitive to optimization dynamics. While Adam-based optimizers offer fast convergence by adapting learning rates coordinate-wise, recent studies reveal that their updates often suffer from poor spectral conditioning and low-rank structures, hindering efficiency. Muon addresses this issue via global spectral normalization but lacks the per-coordinate adaptivity of Adam. In this work, we propose Column-Normalized Adam (Conda), a novel optimizer that bridges the strengths of both approaches. Conda projects updates into an orthogonal subspace and applies column-wise second moment normalization based on the projected gradients, thereby achieving both improved spectral conditioning and maintaining coordinate-wise adaptivity. This design alleviates the spectral pathologies of Adam while preserving its fast convergence behavior. Extensive experiments on the LLaMA and GPT-2 series show that Conda consistently outperforms AdamW, Muon, and other baselines in pre-training. Remarkably, on the LLaMA series, Conda achieves 2-2.5 the convergence speed of AdamW, measured in both training steps and training time. Further ablations demonstrate its robustness under diverse training setups. These results collectively highlight Conda as an effective and broadly applicable optimizer for large-scale LLM training. The code is released on https://github.com/jie040109/Conda
Chinese Summary: 本文提出列归一化Adam优化器,通过正交子空间投影和列向二阶矩归一化,在保持坐标自适应性的同时改善谱条件,在LLaMA预训练中实现比AdamW快2-2.5倍的收敛速度。
English Summary: The paper introduces Column-Normalized Adam (Conda), a novel optimizer that combines improved spectral conditioning with coordinate-wise adaptivity, achieving 2-2.5 times faster convergence than AdamW in LLaMA pre-training experiments.

Authors:Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, Bolin Ding
Title: Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Abstract:
Off-policy reinforcement learning (RL) for large language models (LLMs) is attracting growing interest, driven by practical constraints in real-world applications, the complexity of LLM-RL infrastructure, and the need for further innovations of RL methodologies. While classic REINFORCE and its modern variants like Group Relative Policy Optimization (GRPO) are typically regarded as on-policy algorithms with limited tolerance of off-policyness, we present in this work a first-principles derivation for group-relative REINFORCE without assuming a specific training data distribution, showing that it admits a native off-policy interpretation. This perspective yields two general principles for adapting REINFORCE to off-policy settings: regularizing policy updates, and actively shaping the data distribution. Our analysis demystifies some myths about the roles of importance sampling and clipping in GRPO, unifies and reinterprets two recent algorithms -- Online Policy Mirror Descent (OPMD) and Asymmetric REINFORCE (AsymRE) -- as regularized forms of the REINFORCE loss, and offers theoretical justification for seemingly heuristic data-weighting strategies. Our findings lead to actionable insights that are validated with extensive empirical studies, and open up new opportunities for principled algorithm design in off-policy RL for LLMs. Source code for this work is available at https://github.com/modelscope/Trinity-RFT/tree/main/examples/rec_gsm8k.
Chinese: 本研究从第一性原理推导了群体相对REINFORCE算法,揭示了其天然的离策略特性,提出了适用于离策略场景的两大改进原则,统一了近期相关算法框架,并为大语言模型的强化学习提供了经实证验证的设计思路。
English: This work provides a first-principles derivation of group-relative REINFORCE, demonstrating its native off-policy capability and establishing two principles for adapting REINFORCE to off-policy settings, which unify recent algorithms and offer validated insights for LLM reinforcement learning.

Authors:Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, Hao Tang
Title: UniVid: The Open-Source Unified Video Model
Abstract:
Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines. Code: https://github.com/AIGeeksGroup/UniVid. Website: https://aigeeksgroup.github.io/UniVid.
中文: UniVid提出了一种统一视频架构,通过轻量级适配器将多模态大语言模型与扩散解码器结合,支持视频生成与理解,并采用温度模态对齐和金字塔反射等技术解决语义忠实度和高效时序推理问题,在基准测试中实现了领先性能。
English: UniVid introduces a unified video architecture that integrates an MLLM with a diffusion decoder via a lightweight adapter, enabling both video generation and understanding while addressing challenges like semantic faithfulness and efficient temporal reasoning through innovative techniques such as Temperature Modality Alignment and Pyramid Reflection, achieving state-of-the-art performance on benchmarks.

Authors:Ran Xu, Yuchen Zhuang, Zihan Dong, Jonathan Wang, Yue Yu, Joyce C. Ho, Linjun Zhang, Haoyu Wang, Wenqi Shi, Carl Yang
Title: AceSearcher: Bootstrapping Reasoning and Search for LLMs via Reinforced Self-Play
Abstract:
Search-augmented LLMs often struggle with complex reasoning tasks due to ineffective multi-hop retrieval and limited reasoning ability. We propose AceSearcher, a cooperative self-play framework that trains a single large language model (LLM) to alternate between two roles: a decomposer that breaks down complex queries and a solver that integrates retrieved contexts for answer generation. AceSearcher couples supervised fine-tuning on a diverse mixture of search, reasoning, and decomposition tasks with reinforcement fine-tuning optimized for final answer accuracy, eliminating the need for intermediate annotations. Extensive experiments on three reasoning-intensive tasks across 10 datasets show that AceSearcher outperforms state-of-the-art baselines, achieving an average exact match improvement of 7.6%. Remarkably, on document-level finance reasoning tasks, AceSearcher-32B matches the performance of the DeepSeek-V3 model using less than 5% of its parameters. Even at smaller scales (1.5B and 8B), AceSearcher often surpasses existing search-augmented LLMs with up to 9x more parameters, highlighting its exceptional efficiency and effectiveness in tackling complex reasoning tasks. Our code will be published at https://github.com/ritaranx/AceSearcher and https://huggingface.co/AceSearcher.
中文: AceSearcher是一种协作自博弈框架,通过训练单一大型语言模型交替担任分解复杂查询和整合检索信息生成答案的角色,无需中间标注即可在复杂推理任务中实现卓越性能与效率。
English: AceSearcher is a cooperative self-play framework that trains a single LLM to alternate between decomposing complex queries and solving them with retrieved contexts, achieving superior performance and efficiency in complex reasoning tasks without intermediate annotations.

Authors:Le Dong, Jinghao Bian, Jingyang Hou, Jingliang Hu, Yilei Shi, Weisheng Dong, Xiao Xiang Zhu, Lichao Mou
Title: High-Order Progressive Trajectory Matching for Medical Image Dataset Distillation
Abstract:
Medical image analysis faces significant challenges in data sharing due to privacy regulations and complex institutional protocols. Dataset distillation offers a solution to address these challenges by synthesizing compact datasets that capture essential information from real, large medical datasets. Trajectory matching has emerged as a promising methodology for dataset distillation; however, existing methods primarily focus on terminal states, overlooking crucial information in intermediate optimization states. We address this limitation by proposing a shape-wise potential that captures the geometric structure of parameter trajectories, and an easy-to-complex matching strategy that progressively addresses parameters based on their complexity. Experiments on medical image classification tasks demonstrate that our method improves distillation performance while preserving privacy and maintaining model accuracy comparable to training on the original datasets. Our code is available at https://github.com/Bian-jh/HoP-TM.
Chinese Summary: 本研究提出一种新颖的医学影像数据集蒸馏方法,通过形状势能和由易到难的匹配策略捕捉中间优化状态,在保护隐私的同时提升蒸馏性能,并保持与原始数据集相当的模型精度。
English Summary: This study introduces a novel dataset distillation method for medical imaging that utilizes shape-wise potential and easy-to-complex matching to capture intermediate optimization states, enhancing performance while preserving privacy and maintaining accuracy comparable to original datasets.

Authors:Jun-Hao Wang, Yi-Yang Tian, Baoquan Chen, Peng-Shuai Wang
Title: Neural Visibility of Point Sets
Abstract:
Point clouds are widely used representations of 3D data, but determining the visibility of points from a given viewpoint remains a challenging problem due to their sparse nature and lack of explicit connectivity. Traditional methods, such as Hidden Point Removal (HPR), face limitations in computational efficiency, robustness to noise, and handling concave regions or low-density point clouds. In this paper, we propose a novel approach to visibility determination in point clouds by formulating it as a binary classification task. The core of our network consists of a 3D U-Net that extracts view-independent point-wise features and a shared multi-layer perceptron (MLP) that predicts point visibility using the extracted features and view direction as inputs. The network is trained end-to-end with ground-truth visibility labels generated from rendered 3D models. Our method significantly outperforms HPR in both accuracy and computational efficiency, achieving up to 126 times speedup on large point clouds. Additionally, our network demonstrates robustness to noise and varying point cloud densities and generalizes well to unseen shapes. We validate the effectiveness of our approach through extensive experiments on the ShapeNet, ABC Dataset and real-world datasets, showing substantial improvements in visibility accuracy. We also demonstrate the versatility of our method in various applications, including point cloud visualization, surface reconstruction, normal estimation, shadow rendering, and viewpoint optimization. Our code and models are available at https://github.com/octree-nn/neural-visibility.
中文: 本文提出了一种将点云可见性视为二分类任务的神经网络方法,相比传统方法在精度和计算效率上显著提升,并在多种数据集和应用中展现出良好的鲁棒性。
English: This paper introduces a neural network-based method that treats point cloud visibility as a binary classification task, achieving superior accuracy and computational efficiency compared to traditional approaches while demonstrating robustness across various datasets and applications.

Authors:Deepak Prakash Kumar, Swaroop Darbha, Satyanarayana Gupta Manyam, David Casbeer
Title: A Novel Model for 3D Motion Planning for a Generalized Dubins Vehicle with Pitch and Yaw Rate Constraints
Abstract:
In this paper, we propose a new modeling approach and a fast algorithm for 3D motion planning, applicable for fixed-wing unmanned aerial vehicles. The goal is to construct the shortest path connecting given initial and final configurations subject to motion constraints. Our work differs from existing literature in two ways. First, we consider full vehicle orientation using a body-attached frame, which includes roll, pitch, and yaw angles. However, existing work uses only pitch and/or heading angle, which is insufficient to uniquely determine orientation. Second, we use two control inputs to represent bounded pitch and yaw rates, reflecting control by two separate actuators. In contrast, most previous methods rely on a single input, such as path curvature, which is insufficient for accurately modeling the vehicle's kinematics in 3D. We use a rotation minimizing frame to describe the vehicle's configuration and its evolution, and construct paths by concatenating optimal Dubins paths on spherical, cylindrical, or planar surfaces. Numerical simulations show our approach generates feasible paths within 10 seconds on average and yields shorter paths than existing methods in most cases.
中文: 本文提出了一种针对固定翼无人机的三维运动规划新模型和快速算法,通过双控制输入实现全姿态控制,能在多数情况下生成更短且可行的路径。
English: This paper introduces a novel 3D motion planning model and fast algorithm for fixed-wing UAVs, incorporating full orientation control with dual inputs to generate shorter, feasible paths efficiently.

Authors:Jianze Li, Yong Guo, Yulun Zhang, Xiaokang Yang
Title: Asymmetric VAE for One-Step Video Super-Resolution Acceleration
Abstract:
Diffusion models have significant advantages in the field of real-world video super-resolution and have demonstrated strong performance in past research. In recent diffusion-based video super-resolution (VSR) models, the number of sampling steps has been reduced to just one, yet there remains significant room for further optimization in inference efficiency. In this paper, we propose FastVSR, which achieves substantial reductions in computational cost by implementing a high compression VAE (spatial compression ratio of 16, denoted as f16). We design the structure of the f16 VAE and introduce a stable training framework. We employ pixel shuffle and channel replication to achieve additional upsampling. Furthermore, we propose a lower-bound-guided training strategy, which introduces a simpler training objective as a lower bound for the VAE's performance. It makes the training process more stable and easier to converge. Experimental results show that FastVSR achieves speedups of 111.9 times compared to multi-step models and 3.92 times compared to existing one-step models. We will release code and models at https://github.com/JianzeLi-114/FastVSR.
中文:FastVSR通过采用高压缩比VAE和稳定的训练框架,显著提升了视频超分辨率的推理效率,相比现有模型实现了大幅加速。
English: FastVSR significantly enhances inference efficiency in video super-resolution by employing a highly compressed VAE and a stable training framework, achieving substantial speed improvements over existing models.

Authors:Md Mozaharul Mottalib, Thao-Ly T. Phan, Rahmatollah Beheshti
Title: HyMaTE: A Hybrid Mamba and Transformer Model for EHR Representation Learning
Abstract:
Electronic health Records (EHRs) have become a cornerstone in modern-day healthcare. They are a crucial part for analyzing the progression of patient health; however, their complexity, characterized by long, multivariate sequences, sparsity, and missing values poses significant challenges in traditional deep learning modeling. While Transformer-based models have demonstrated success in modeling EHR data and predicting clinical outcomes, their quadratic computational complexity and limited context length hinder their efficiency and practical applications. On the other hand, State Space Models (SSMs) like Mamba present a promising alternative offering linear-time sequence modeling and improved efficiency for handling long sequences, but focus mostly on mixing sequence-level information rather than channel-level data. To overcome these challenges, we propose HyMaTE (A Hybrid Mamba and Transformer Model for EHR Representation Learning), a novel hybrid model tailored for representing longitudinal data, combining the strengths of SSMs with advanced attention mechanisms. By testing the model on predictive tasks on multiple clinical datasets, we demonstrate HyMaTE's ability to capture an effective, richer, and more nuanced unified representation of EHR data. Additionally, the interpretability of the outcomes achieved by self-attention illustrates the effectiveness of our model as a scalable and generalizable solution for real-world healthcare applications. Codes are available at: https://github.com/healthylaife/HyMaTE.
中文:HyMaTE模型融合状态空间模型与Transformer注意力机制,能高效学习复杂电子健康记录中的细微特征,在临床预测任务中展现出优越性能和可解释性。
English: The HyMaTE model combines State Space Models and Transformer attention to efficiently learn nuanced representations from complex Electronic Health Records, demonstrating superior performance and interpretability in clinical predictions.

Authors:Li Zhang, Haoxiang Gao, Zhihao Zhang, Luoxiao Huang, Tao Zhang
Title: SVAC: Scaling Is All You Need For Referring Video Object Segmentation
Abstract:
Referring Video Object Segmentation (RVOS) aims to segment target objects in video sequences based on natural language descriptions. While recent advances in Multi-modal Large Language Models (MLLMs) have improved RVOS performance through enhanced text-video understanding, several challenges remain, including insufficient exploitation of MLLMs' prior knowledge, prohibitive computational and memory costs for long-duration videos, and inadequate handling of complex temporal dynamics. In this work, we propose SVAC, a unified model that improves RVOS by scaling up input frames and segmentation tokens to enhance video-language interaction and segmentation precision. To address the resulting computational challenges, SVAC incorporates the Anchor-Based Spatio-Temporal Compression (ASTC) module to compress visual tokens while preserving essential spatio-temporal structure. Moreover, the Clip-Specific Allocation (CSA) strategy is introduced to better handle dynamic object behaviors across video clips. Experimental results demonstrate that SVAC achieves state-of-the-art performance on multiple RVOS benchmarks with competitive efficiency. Our code is available at https://github.com/lizhang1998/SVAC.
Chinese: 提出的SVAC模型通过扩展输入帧和分割标记来改进参考视频对象分割,同时采用压缩和分配策略以保持效率并处理动态视频内容,在多个基准测试中实现了最优性能。
English: The proposed SVAC model enhances Referring Video Object Segmentation by scaling frame inputs and segmentation tokens while employing compression and allocation strategies to maintain efficiency and handle dynamic video content, achieving state-of-the-art results on benchmarks.

Authors:Kaiyu He, Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Xinya Du, Zhiyu Chen
Title: GEAR: A General Evaluation Framework for Abductive Reasoning
Abstract:
Since the advent of large language models (LLMs), research has focused on instruction following and deductive reasoning. A central question remains: can these models discover new knowledge, and how can we evaluate this ability? We address this by studying abductive reasoning-the generation of plausible hypotheses to explain observations-and introduce GEAR (General Evaluation for Abductive Reasoning), a general-purpose, fully automated, transparent, and label-free evaluation paradigm. GEAR scores hypothesis sets by three metrics: consistency (each hypothesis explains the observations), generalizability (consistent hypotheses make meaningful predictions on unseen inputs), and diversity (the set covers distinct predictions and patterns). Built this way, GEAR is scalable (no human gold answers), reliable (deterministic scoring aligned with classical abduction), and open-ended (scores improve only when models produce new plausible hypotheses, unlike static benchmarks that saturate once accuracy is high). Using GEAR, we conduct a fine-grained study of nine LLMs on four abduction benchmarks with 1,500 problems, generating over 50,000 candidate hypotheses and revealing model differences obscured by gold-answer or purely human evaluations. We further propose a momentum-based curriculum that adjusts GEAR-derived training data by learning velocity: it starts with what the model learns quickly and shifts toward harder objectives such as generating diverse hypotheses once the model is confident on foundational objectives. Without gold-label supervision, this strategy improves all GEAR objectives and these gains transfer to established abductive reasoning benchmarks. Taken together, GEAR provides a principled framework that evaluates abduction and supplies label-free, scalable training signals that help LLMs produce more diverse and reliable hypotheses.
This research introduces GEAR, a novel evaluation framework for assessing large language models' abductive reasoning ability through automated scoring of hypothesis consistency, generalizability, and diversity, while also proposing a momentum-based curriculum that improves model performance without requiring labeled data.
English Summary:

Authors:Zeqing Wang, Gongfan Fang, Xinyin Ma, Xingyi Yang, Xinchao Wang
Title: SparseD: Sparse Attention for Diffusion Language Models
Abstract:
While diffusion language models (DLMs) offer a promising alternative to autoregressive models (ARs), existing open-source DLMs suffer from high inference latency. This bottleneck is mainly due to the attention's quadratic complexity with respect to context length in computing all query-key pairs. Intuitively, to reduce this complexity, a natural strategy is to restrict attention to sparse patterns that retain only the most relevant connections. Such approaches are well-established in ARs, where attention follows fixed and clearly defined sparse patterns. However, in DLMs, we observe distinct sparsity behaviors: (1) attention patterns vary across heads, (2) attention patterns in each head remain highly similar across denoising steps, and (3) early denoising steps are critical for generation. These findings render sparse attention methods designed for ARs largely incompatible with DLMs, as they fail to capture head-specific structures and risk degrading generation when applied in early denoising steps. To address these challenges, we propose SparseD, a novel sparse attention method for DLMs. Leveraging the observations, SparseD only requires pre-computing head-specific sparse patterns one time, and reuses them across all steps. This prevents recomputing sparse patterns at each denoising step. Meanwhile, SparseD uses full attention in the early steps, then switches to sparse attention later to maintain generation quality. Together, these establish SparseD as a practical and efficient solution for deploying DLMs in long-context applications. Experimental results demonstrate that SparseD achieves lossless acceleration, delivering up to $1.50\times$ speedup over FlashAttention at a 64k context length with 1,024 denoising steps.
中文: 现有稀疏注意力方法因无法适应扩散语言模型的独特稀疏特性而失效,因此提出SparseD方法,通过预计算头部特定模式并在早期步骤保留完整注意力,实现了无损加速效果。
English: Existing sparse attention methods are incompatible with diffusion language models due to their distinct sparsity behaviors, so SparseD is proposed as a novel method that pre-computes head-specific patterns and strategically uses full attention in early steps to achieve lossless acceleration.

Authors:Yangzhou Liu, Yue Cao, Hao Li, Gen Luo, Zhe Chen, Weiyun Wang, Xiaobo Liang, Biqing Qi, Lijun Wu, Changyao Tian, Yanting Zhang, Yuqiang Li, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang
Title: Sequential Diffusion Language Models
Abstract:
Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and requires expensive training. We introduce Next Sequence Prediction (NSP), which unifies next-token and next-block prediction, enabling the model to adaptively determine the generation length at each step. When the length is fixed to 1, NSP reduces to standard next-token prediction. Building on NSP, we propose Sequential Diffusion Language Model (SDLM), which can retrofit pre-trained autoregressive language models (ALMs) at minimal cost. Specifically, SDLM performs diffusion inference within fixed-size mask blocks, but dynamically decodes consecutive subsequences based on model confidence, thereby preserving KV-cache compatibility and improving robustness to varying uncertainty and semantics across the sequence. Experiments show that SDLM matches or surpasses strong autoregressive baselines using only 3.5M training samples, while achieving 2.1 higher throughput than Qwen-2.5. Notably, the SDLM-32B model delivers even more pronounced efficiency gains, demonstrating the strong scalability potential of our modeling paradigm. Project page and codes: https://github.com/OpenGVLab/SDLM
中文:提出的序列扩散语言模型(SDLM)通过引入下一序列预测机制,在保持KV缓存兼容性的同时实现自适应生成长度,仅需少量训练数据即可超越自回归基线模型并显著提升效率。
English: The proposed Sequential Diffusion Language Model (SDLM) introduces Next Sequence Prediction to enable adaptive generation lengths while maintaining KV-cache compatibility, achieving superior efficiency and performance over autoregressive baselines with minimal training data.

Authors:Matej Palider, Omar Eldardeer, Viktor Kocur
Title: Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform
Abstract:
This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.
中文: 本文评估了人机交互中的视线估计方法,发现尽管角度误差接近基准水平,但实际应用中最佳中位误差达16.48厘米,揭示了当前方法的局限并提出了改进建议。
English: This paper assesses gaze estimation methods in human-robot interaction, revealing a median error of 16.48 cm in practical applications despite competitive angular accuracy, and suggests improvements for integration.

Authors:Alistair Turcan, Kexin Huang, Lei Li, Martin Jinye Zhang
Title: TusoAI: Agentic Optimization for Scientific Methods
Abstract:
Scientific discovery is often slowed by the manual development of computational tools needed to analyze complex experimental data. Building such tools is costly and time-consuming because scientists must iteratively review literature, test modeling and scientific assumptions against empirical data, and implement these insights into efficient software. Large language models (LLMs) have demonstrated strong capabilities in synthesizing literature, reasoning with empirical data, and generating domain-specific code, offering new opportunities to accelerate computational method development. Existing LLM-based systems either focus on performing scientific analyses using existing computational methods or on developing computational methods or models for general machine learning without effectively integrating the often unstructured knowledge specific to scientific domains. Here, we introduce TusoAI , an agentic AI system that takes a scientific task description with an evaluation function and autonomously develops and optimizes computational methods for the application. TusoAI integrates domain knowledge into a knowledge tree representation and performs iterative, domain-specific optimization and model diagnosis, improving performance over a pool of candidate solutions. We conducted comprehensive benchmark evaluations demonstrating that TusoAI outperforms state-of-the-art expert methods, MLE agents, and scientific AI agents across diverse tasks, such as single-cell RNA-seq data denoising and satellite-based earth monitoring. Applying TusoAI to two key open problems in genetics improved existing computational methods and uncovered novel biology, including 9 new associations between autoimmune diseases and T cell subtypes and 7 previously unreported links between disease variants linked to their target genes. Our code is publicly available at https://github.com/Alistair-Turcan/TusoAI.
中文摘要:TusoAI 是一种自主智能系统,通过整合领域知识和迭代优化,自动开发并改进计算方法,在单细胞RNA测序和地球监测等任务中超越现有方法,并成功应用于遗传学难题取得新发现。
English Summary: TusoAI is an autonomous agentic system that accelerates scientific discovery by developing and optimizing computational methods through domain-specific knowledge integration and iterative refinement, outperforming existing approaches in tasks like genetic analysis and earth monitoring.

Authors:Jinpei Guo, Yifei Ji, Zheng Chen, Yufei Wang, Sizhuo Ma, Yong Guo, Yulun Zhang, Jian Wang
Title: Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution
Abstract:
Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{https://github.com/jp-guo/OASIS}{https://github.com/jp-guo/OASIS}.
Chinese: 研究者提出OASIS,一种具有注意力专门化的高效一步扩散模型,通过减少计算冗余并优化视频超分辨率性能,在合成和真实数据集上达到最优效果,并实现比基线模型快6.2倍的推理速度。
English: Researchers propose OASIS, an efficient one-step diffusion model with attention specialization that reduces computational redundancy and enhances performance in video super-resolution, achieving state-of-the-art results and a 6.2× speedup over baselines.

Authors:Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lucaz Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyang Peng, Yuanbo Peng, Xiangwei Shen, Yixuan Shi, Jiale Tao, Yangyu Tao, Qi Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyan Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fang Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zheng Yuan, Chao Zhang, Jian-Wei Zhang, Peizhen Zhang, Shi-Xue Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jianchen Zhu, Zhao Zhong
Title: HunyuanImage 3.0 Technical Report
Abstract:
We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0
HunyuanImage 3.0 是一个突破性的开源多模态模型,通过自回归框架统一理解与生成功能,拥有超过800亿参数,在文本-图像对齐和视觉质量方面树立了新的标杆。
HunyuanImage 3.0 is a groundbreaking open-source multimodal model that integrates understanding and generation within an autoregressive framework, featuring over 80 billion parameters and setting new standards in text-image alignment and visual quality.

Authors:Surya Murthy, Kushagra Gupta, Mustafa O. Karabag, David Fridovich-Keil, Ufuk Topcu
Title: DiBS-MTL: Transformation-Invariant Multitask Learning with Direction Oracles
Abstract:
Multitask learning (MTL) algorithms typically rely on schemes that combine different task losses or their gradients through weighted averaging. These methods aim to find Pareto stationary points by using heuristics that require access to task loss values, gradients, or both. In doing so, a central challenge arises because task losses can be arbitrarily, nonaffinely scaled relative to one another, causing certain tasks to dominate training and degrade overall performance. A recent advance in cooperative bargaining theory, the Direction-based Bargaining Solution (DiBS), yields Pareto stationary solutions immune to task domination because of its invariance to monotonic nonaffine task loss transformations. However, the convergence behavior of DiBS in nonconvex MTL settings is currently not understood. To this end, we prove that under standard assumptions, a subsequence of DiBS iterates converges to a Pareto stationary point when task losses are possibly nonconvex, and propose DiBS-MTL, a computationally efficient adaptation of DiBS to the MTL setting. Finally, we validate DiBS-MTL empirically on standard MTL benchmarks, showing that it achieves competitive performance with state-of-the-art methods while maintaining robustness to nonaffine monotonic transformations that significantly degrade the performance of existing approaches, including prior bargaining-inspired MTL methods. Code available at https://github.com/suryakmurthy/dibs-mtl.
中文: 基于合作博弈论的DiBS-MTL多任务学习新算法,能保证收敛至帕累托稳定解并在面对非线性任务损失缩放时保持鲁棒性,在基准测试中优于现有方法。
English: DiBS-MTL, a novel multitask learning algorithm based on cooperative bargaining theory, ensures convergence to Pareto stationary solutions and maintains robustness against nonaffine task loss scaling, outperforming existing methods in benchmark tests.

Authors:Dragoş-Andrei Chileban, Andrei-Ştefan Bulzan, Cosmin Cernǎzanu-Glǎvan
Title: CrashSplat: 2D to 3D Vehicle Damage Segmentation in Gaussian Splatting
Abstract:
Automatic car damage detection has been a topic of significant interest for the auto insurance industry as it promises faster, accurate, and cost-effective damage assessments. However, few works have gone beyond 2D image analysis to leverage 3D reconstruction methods, which have the potential to provide a more comprehensive and geometrically accurate representation of the damage. Moreover, recent methods employing 3D representations for novel view synthesis, particularly 3D Gaussian Splatting (3D-GS), have demonstrated the ability to generate accurate and coherent 3D reconstructions from a limited number of views. In this work we introduce an automatic car damage detection pipeline that performs 3D damage segmentation by up-lifting 2D masks. Additionally, we propose a simple yet effective learning-free approach for single-view 3D-GS segmentation. Specifically, Gaussians are projected onto the image plane using camera parameters obtained via Structure from Motion (SfM). They are then filtered through an algorithm that utilizes Z-buffering along with a normal distribution model of depth and opacities. Through experiments we found that this method is particularly effective for challenging scenarios like car damage detection, where target objects (e.g., scratches, small dents) may only be clearly visible in a single view, making multi-view consistency approaches impractical or impossible. The code is publicly available at: https://github.com/DragosChileban/CrashSplat.
中文: 本研究提出了一种自动化汽车损伤检测流程,利用3D高斯溅射实现单视图3D分割,通过无需学习的新方法有效解决了损伤仅在某单一视角可见的检测难题。
English: This study introduces an automated car damage detection pipeline that leverages 3D Gaussian Splatting for single-view 3D segmentation, effectively addressing scenarios where damage is only visible in one view through a novel learning-free approach.

Authors:Kaisen Yang, Lixuan He, Rushi Shah, Kaicheng Yang, Qinwei Ma, Dianbo Liu, Alex Lamb
Title: Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm
Abstract:
Chain-of-Thought (CoT) and its variants have markedly advanced the reasoning abilities of Large Language Models (LLMs), yet their monolithic and auto-regressive architecture inherently conflates high-level strategic planning with low-level step-by-step execution, leading to computational inefficiency, limited exploration of reasoning paths, and reduced interpretability. To overcome these issues, we propose the Explore-Execute Chain ($E^2C$), a structured reasoning framework that decouples reasoning into two distinct phases: an exploratory phase that stochastically generates succinct high-level plans, followed by an execution phase that deterministically carries out the chosen plan. Our approach incorporates a two-stage training methodology, which combines Supervised Fine-Tuning (SFT) - augmented by a novel data generation algorithm enforcing strict plan adherence - with a subsequent Reinforcement Learning (RL) stage that capitalizes on the informativeness of exploration and reinforces the determinism of execution. This decomposition enables an efficient test-time scaling strategy: on AIME'2024, $E^2C$ Test Time Scaling reaches 58.1% accuracy using <10% of the decoding tokens required by comparable methods (e.g., Forest-of-Thought), sharply cutting self-consistency overhead. For cross-domain adaptation, our Exploration-Focused SFT (EF-SFT) fine-tunes with only 3.5% of the tokens used by standard SFT yet yields up to 14.5% higher accuracy than standard SFT on medical benchmarks, delivering state-of-the-art performance, strong generalization, and greater interpretability by separating planning from execution. The code and pre-trained models for the project are available at: https://github.com/yks23/Explore-Execute-Chain.git
中文:提出的探索-执行链(E²C)框架将推理分解为独立的规划与执行阶段,在比现有方法减少90%以上令牌用量的同时,显著提升了计算效率、准确性和可解释性。
English: The proposed Explore-Execute Chain (E²C) framework decouples reasoning into separate planning and execution phases, significantly improving computational efficiency, accuracy, and interpretability while reducing token usage by over 90% compared to existing methods.

Authors:Hanshi Wang, Yuhao Xu, Zekun Xu, Jin Gao, Yufan Liu, Weiming Hu, Ke Wang, Zhipeng Zhang
Title: AutoPrune: Each Complexity Deserves a Pruning Policy
Abstract:
The established redundancy in visual tokens within large vision-language models allows pruning to effectively reduce their substantial computational demands. Previous methods typically employ heuristic layer-specific pruning strategies where, although the number of tokens removed may differ across decoder layers, the overall pruning schedule is fixed and applied uniformly to all input samples and tasks, failing to align token elimination with the model's holistic reasoning trajectory. Cognitive science indicates that human visual processing often begins with broad exploration to accumulate evidence before narrowing focus as the target becomes distinct. Our experiments reveal an analogous pattern in these models. This observation suggests that neither a fixed pruning schedule nor a heuristic layer-wise strategy can optimally accommodate the diverse complexities inherent in different inputs. To overcome this limitation, we introduce Complexity-Adaptive Pruning (AutoPrune), a training-free, plug-and-play framework that tailors pruning policies to varying sample and task complexities. Specifically, AutoPrune quantifies the mutual information between visual and textual tokens, then projects this signal to a budget-constrained logistic retention curve. Each such logistic curve, defined by its unique shape, corresponds to the specific complexity of different tasks and can guarantee adherence to predefined computational constraints. We evaluate AutoPrune on standard vision-language tasks and on Vision-Language-Action models for autonomous driving. Notably, when applied to LLaVA-1.5-7B, our method prunes 89% of visual tokens and reduces inference FLOPs by 76.8% while retaining 96.7% of the original accuracy averaged over all tasks. This corresponds to a 9.1% improvement over the recent work PDrop, demonstrating the effectiveness. Code is available at https://github.com/AutoLab-SAI-SJTU/AutoPrune.
中文:AutoPrune框架通过量化视觉与文本标记间的互信息,针对不同输入复杂度自适应调整剪枝策略,在保持多任务高精度的同时大幅降低了计算开销。
English: The proposed AutoPrune framework dynamically tailors pruning policies to input complexity by leveraging mutual information between visual and textual tokens, achieving significant computational savings while maintaining high accuracy across vision-language tasks.

Authors:Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao
Title: Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step
Abstract:
Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.
中文: 掩码扩散语言模型虽具备并行解码和灵活生成的优势,但在解码策略和强化学习方面存在不足;为此提出的EOS早期拒绝和递增步长解码调度器,以及一致性轨迹组相对策略优化方法,有效提升了推理任务的性能与效率。
English: Masked diffusion language models offer parallel decoding and flexible generation but face challenges with suboptimal decoding strategies and reinforcement learning inconsistencies, which are addressed by new techniques like EOS Early Rejection and Ascending Step-Size decoding scheduler, along with Consistency Trajectory Group Relative Policy Optimization, to enhance performance and efficiency in reasoning tasks.

Authors:Haibao Yu, Wenxian Yang, Ruiyang Hao, Chuanye Wang, Jiaru Zhong, Ping Luo, Zaiqing Nie
Title: DriveE2E: Closed-Loop Benchmark for End-to-End Autonomous Driving through Real-to-Simulation
Abstract:
Closed-loop evaluation is increasingly critical for end-to-end autonomous driving. Current closed-loop benchmarks using the CARLA simulator rely on manually configured traffic scenarios, which can diverge from real-world conditions, limiting their ability to reflect actual driving performance. To address these limitations, we introduce a simple yet challenging closed-loop evaluation framework that closely integrates real-world driving scenarios into the CARLA simulator with infrastructure cooperation. Our approach involves extracting 800 dynamic traffic scenarios selected from a comprehensive 100-hour video dataset captured by high-mounted infrastructure sensors, and creating static digital twin assets for 15 real-world intersections with consistent visual appearance. These digital twins accurately replicate the traffic and environmental characteristics of their real-world counterparts, enabling more realistic simulations in CARLA. This evaluation is challenging due to the diversity of driving behaviors, locations, weather conditions, and times of day at complex urban intersections. In addition, we provide a comprehensive closed-loop benchmark for evaluating end-to-end autonomous driving models. Project URL: \href{https://github.com/AIR-THU/DriveE2E}{https://github.com/AIR-THU/DriveE2E}.
中文: 本文提出了一种创新的闭环评估框架,将800个真实交通场景和15个数字孪生交叉口集成到CARLA模拟器中,为自动驾驶评估创建了更真实的基准测试环境。
English: This paper introduces a novel closed-loop evaluation framework that integrates 800 real-world traffic scenarios and 15 digital twin intersections into CARLA simulator, creating more realistic benchmarks for autonomous driving assessment.

Authors:Xin Luo, Jiahao Wang, Chenyuan Wu, Shitao Xiao, Xiyan Jiang, Defu Lian, Jiajun Zhang, Dong Liu, Zheng liu
Title: EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling
Abstract:
Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce EditReward-Bench, a comprehensive benchmark to systematically evaluate reward models on editing quality. Building on this benchmark, we develop EditScore, a series of reward models (7B-72B) for evaluating the quality of instruction-guided image editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain.
中文: 本研究提出了用于评估图像编辑奖励模型的基准EditReward-Bench,并开发了专用奖励模型EditScore,该模型能够有效实现指令引导图像编辑的强化学习,从而显著提升编辑性能。
English: This work introduces EditReward-Bench, a benchmark for evaluating reward models in image editing, and develops EditScore, a specialized reward model that enables effective reinforcement learning for instruction-guided image editing, leading to substantial performance improvements.

Authors:You Zhou, Lijiang Chen, Shuchang Lyu, Guangxia Cui, Wenpei Bai, Zheng Zhou, Meng Li, Guangliang Cheng, Huiyu Zhou, Qi Zhao
Title: Adversarial Versus Federated: An Adversarial Learning based Multi-Modality Cross-Domain Federated Medical Segmentation
Abstract:
Federated learning enables collaborative training of machine learning models among different clients while ensuring data privacy, emerging as the mainstream for breaking data silos in the healthcare domain. However, the imbalance of medical resources, data corruption or improper data preservation may lead to a situation where different clients possess medical images of different modality. This heterogeneity poses a significant challenge for cross-domain medical image segmentation within the federated learning framework. To address this challenge, we propose a new Federated Domain Adaptation (FedDA) segmentation training framework. Specifically, we propose a feature-level adversarial learning among clients by aligning feature maps across clients through embedding an adversarial training mechanism. This design can enhance the model's generalization on multiple domains and alleviate the negative impact from domain-shift. Comprehensive experiments on three medical image datasets demonstrate that our proposed FedDA substantially achieves cross-domain federated aggregation, endowing single modality client with cross-modality processing capabilities, and consistently delivers robust performance compared to state-of-the-art federated aggregation algorithms in objective and subjective assessment. Our code are available at https://github.com/GGbond-study/FedDA.
中文:提出的联邦域适应(FedDA)框架通过特征级对抗训练对齐客户端间的特征映射,解决了联邦学习中跨域医学图像分割的挑战,增强了模型泛化能力并减轻了域偏移影响,在多个数据集上的验证显示其性能优于现有方法。
English: The proposed Federated Domain Adaptation (FedDA) framework addresses cross-domain medical image segmentation challenges in federated learning by implementing feature-level adversarial training to align feature maps across clients, enhancing model generalization and mitigating domain-shift effects, as validated by superior performance on multiple datasets.

Authors:Zhixin Zhang, Zeming Wei, Meng Sun
Title: Dynamic Orthogonal Continual Fine-tuning for Mitigating Catastrophic Forgettings
Abstract:
Catastrophic forgetting remains a critical challenge in continual learning for large language models (LLMs), where models struggle to retain performance on historical tasks when fine-tuning on new sequential data without access to past datasets. In this paper, we first reveal that the drift of functional directions during the fine-tuning process is a key reason why existing regularization-based methods fail in long-term LLM continual learning. To address this, we propose Dynamic Orthogonal Continual (DOC) fine-tuning, a novel approach that tracks the drift of these functional directions and dynamically updates them during the fine-tuning process. Furthermore, by adjusting the gradients of new task parameters to be orthogonal to the tracked historical function directions, our method mitigates interference between new and old tasks. Extensive experiments on various LLM continual learning benchmarks demonstrate that this approach outperforms prior methods, effectively reducing catastrophic forgetting and providing a robust tool for continuous LLM fine-tuning. Our code is available at https://github.com/meloxxxxxx/DOC.
中文: 本文提出动态正交持续微调方法,通过追踪功能方向漂移并动态更新,同时调整新任务参数梯度使其与历史功能方向正交,有效缓解大语言模型持续学习中的灾难性遗忘问题,在多个基准测试中表现优异。
English: This paper introduces Dynamic Orthogonal Continual (DOC) fine-tuning, a novel method that addresses catastrophic forgetting in LLMs by tracking and dynamically updating functional direction drifts while enforcing gradient orthogonality between new and historical tasks, achieving superior performance across benchmarks.

Authors:Yukun Chen, Boheng Li, Yu Yuan, Leyi Qi, Yiming Li, Tianwei Zhang, Zhan Qin, Kui Ren
Title: Taught Well Learned Ill: Towards Distillation-conditional Backdoor Attack
Abstract:
Knowledge distillation (KD) is a vital technique for deploying deep neural networks (DNNs) on resource-constrained devices by transferring knowledge from large teacher models to lightweight student models. While teacher models from third-party platforms may undergo security verification (\eg, backdoor detection), we uncover a novel and critical threat: distillation-conditional backdoor attacks (DCBAs). DCBA injects dormant and undetectable backdoors into teacher models, which become activated in student models via the KD process, even with clean distillation datasets. While the direct extension of existing methods is ineffective for DCBA, we implement this attack by formulating it as a bilevel optimization problem and proposing a simple yet effective method (\ie, SCAR). Specifically, the inner optimization simulates the KD process by optimizing a surrogate student model, while the outer optimization leverages outputs from this surrogate to optimize the teacher model for implanting the conditional backdoor. Our SCAR addresses this complex optimization utilizing an implicit differentiation algorithm with a pre-optimized trigger injection function. Extensive experiments across diverse datasets, model architectures, and KD techniques validate the effectiveness of our SCAR and its resistance against existing backdoor detection, highlighting a significant yet previously overlooked vulnerability in the KD process. Our code is available at https://github.com/WhitolfChen/SCAR.
中文: 知识蒸馏技术可将大型教师模型的知识转移到轻量级学生模型,但新发现的蒸馏条件后门攻击(DCBA)能在教师模型中植入潜伏后门,这些后门在蒸馏过程中会被激活到学生模型中,我们提出的SCAR方法通过双层优化成功实现了这种攻击,并在多种实验环境中验证了其有效性。
English: Knowledge distillation enables efficient deployment of deep neural networks on resource-limited devices, but a new threat called distillation-conditional backdoor attacks (DCBAs) can implant dormant backdoors in teacher models that activate in student models during distillation, which our proposed SCAR method effectively implements and demonstrates across various datasets and architectures.

Authors:Tian Nian, Weijie Ke, Yao Mu, Tianxing Chen, Shaolong Zhu, Bingshan Hu
Title: Control Your Robot: A Unified System for Robot Control and Policy Deployment
Abstract:
Cross-platform robot control remains difficult because hardware interfaces, data formats, and control paradigms vary widely, which fragments toolchains and slows deployment. To address this, we present Control Your Robot, a modular, general-purpose framework that unifies data collection and policy deployment across diverse platforms. The system reduces fragmentation through a standardized workflow with modular design, unified APIs, and a closed-loop architecture. It supports flexible robot registration, dual-mode control with teleoperation and trajectory playback, and seamless integration from multimodal data acquisition to inference. Experiments on single-arm and dual-arm systems show efficient, low-latency data collection and effective support for policy learning with imitation learning and vision-language-action models. Policies trained on data gathered by Control Your Robot match expert demonstrations closely, indicating that the framework enables scalable and reproducible robot learning across platforms.
中文:Control Your Robot框架通过模块化设计、统一API和闭环架构解决了跨平台机器人控制的碎片化问题,实现了高效的数据收集与策略部署,其性能可与专家演示相媲美。
English: The Control Your Robot framework addresses cross-platform robot control fragmentation by providing a modular system with unified APIs and a closed-loop architecture, enabling efficient data collection and policy deployment that matches expert performance.

Authors:Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, Dapeng Wu
Title: Tequila: Trapping-free Ternary Quantization for Large Language Models
Abstract:
Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim.
中文: Tequila是一种创新的量化方法,通过将死区边界权重重新用作动态偏置来激活它们,从而以最小精度损失和显著加速实现高效的三元大语言模型部署。
English: Tequila is a novel quantization method that reactivates deadzone-trapped weights by converting them into dynamic biases, enabling efficient ternary LLM deployment with minimal accuracy loss and significant speedup.

Authors:Xiaojie Li, Bei Wang, Jianlong Wu, Yue Yu, Liqiang Nie, Min Zhang
Title: GenView++: Unifying Adaptive View Generation and Quality-Driven Supervision for Contrastive Representation Learning
Abstract:
The success of contrastive learning depends on the construction and utilization of high-quality positive pairs. However, current methods face critical limitations on two fronts: on the construction side, both handcrafted and generative augmentations often suffer from limited diversity and risk semantic corruption; on the learning side, the absence of a quality assessment mechanism leads to suboptimal supervision where all pairs are treated equally. To tackle these challenges, we propose GenView++, a unified framework that addresses both fronts by introducing two synergistic innovations. To improve pair construction, GenView++ introduces a multi-source adaptive view generation mechanism to synthesize diverse yet semantically coherent views by dynamically modulating generative parameters across image-conditioned, text-conditioned, and image-text-conditioned strategies. Second, a quality-driven contrastive learning mechanism assesses each pair's semantic alignment and diversity to dynamically reweight their training contribution, prioritizing high-quality pairs while suppressing redundant or misaligned pairs. Extensive experiments demonstrate the effectiveness of GenView++ across both vision and vision-language tasks. For vision representation learning, it improves MoCov2 by +2.5% on ImageNet linear classification. For vision-language learning, it raises the average zero-shot classification accuracy by +12.31% over CLIP and +5.31% over SLIP across ten datasets, and further improves Flickr30k text retrieval R@5 by +3.2%. The code is available at https://github.com/xiaojieli0903/GenViewPlusPlus.
中文: GenView++ 提出了一个统一框架,通过多源自适应视图生成机制创建多样化且语义一致的对比对,并采用质量驱动的对比学习机制动态优化高质量对的训练权重,在视觉和视觉语言任务中取得了显著性能提升。
English: GenView++ introduces a unified framework with multi-source adaptive view generation to create diverse, semantically coherent pairs and a quality-driven contrastive learning mechanism that dynamically prioritizes high-quality pairs, achieving significant performance gains in vision and vision-language tasks.

Authors:Arshia Yousefi Nezhad, Helia Aghaei, Hedieh Sajedi
Title: PVTAdpNet: Polyp Segmentation using Pyramid vision transformer with a novel Adapter block
Abstract:
Colorectal cancer ranks among the most common and deadly cancers, emphasizing the need for effective early detection and treatment. To address the limitations of traditional colonoscopy, including high miss rates due to polyp variability, we introduce the Pyramid Vision Transformer Adapter Residual Network (PVTAdpNet). This model integrates a U-Net-style encoder-decoder structure with a Pyramid Vision Transformer backbone, novel residual blocks, and adapter-based skip connections. The design enhances feature extraction, dense prediction, and gradient flow, supported by squeeze-and-excitation attention for improved channel-wise feature refinement. PVTAdpNet achieves real-time, accurate polyp segmentation, demonstrating superior performance on benchmark datasets with high mDice and mIoU scores, making it highly suitable for clinical applications. PVTAdpNet obtains a high Dice coefficient of 0.8851 and a mean Intersection over Union (mIoU) of 0.8167 on out-of-distribution polyp datasets. Evaluation of the PolypGen dataset demonstrates PVTAdpNet's capability for real-time, accurate performance within familiar distributions. The source code of our network is available at https://github.com/ayousefinejad/PVTAdpNet.git
中文:PVTAdpNet模型通过融合金字塔视觉Transformer与适配器连接,实现了结直肠癌检测中息肉的实时精准分割,在基准数据集上展现出卓越性能。
English: PVTAdpNet, a novel model combining Pyramid Vision Transformer with adapter-based connections, achieves real-time, accurate polyp segmentation for colorectal cancer detection, demonstrating superior performance on benchmark datasets.

Authors:Li Wang, Sudun, Xingjian Zhang, Wenjun Wu, Lei Huang
Title: An Investigation of Batch Normalization in Off-Policy Actor-Critic Algorithms
Abstract:
Batch Normalization (BN) has played a pivotal role in the success of deep learning by improving training stability, mitigating overfitting, and enabling more effective optimization. However, its adoption in deep reinforcement learning (DRL) has been limited due to the inherent non-i.i.d. nature of data and the dynamically shifting distributions induced by the agent's learning process. In this paper, we argue that, despite these challenges, BN retains unique advantages in DRL settings, particularly through its stochasticity and its ability to ease training. When applied appropriately, BN can adapt to evolving data distributions and enhance both convergence speed and final performance. To this end, we conduct a comprehensive empirical study on the use of BN in off-policy actor-critic algorithms, systematically analyzing how different training and evaluation modes impact performance. We further identify failure modes that lead to instability or divergence, analyze their underlying causes, and propose the Mode-Aware Batch Normalization (MA-BN) method with practical actionable recommendations for robust BN integration in DRL pipelines. We also empirically validate that, in RL settings, MA-BN accelerates and stabilizes training, broadens the effective learning rate range, enhances exploration, and reduces overall optimization difficulty. Our code is available at: https://github.com/monster476/ma-bn.git.
中文: 尽管深度强化学习中存在挑战,批量归一化仍具独特优势,而提出的模式感知批量归一化方法提升了训练稳定性和性能。
English: Despite challenges in deep reinforcement learning, Batch Normalization offers unique benefits, and the proposed Mode-Aware Batch Normalization method enhances training stability and performance.

Authors:Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan
Title: Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Abstract:
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.
中文摘要:多模态推理仅在模态提供独立逻辑路径时得以提升,而集成失败(非感知问题)是主要瓶颈;通过结构化评估框架发现任务组合与融合障碍,并提出分步提示与早期融合控制作为解决方向。
English Summary: Multimodal reasoning improves only when modalities provide independent logical paths, with integration failures—not perception—being the primary bottleneck, as revealed through a structured evaluation framework identifying task-composition and fusion issues.

Authors:Yewang Chen, Junfeng Li, Shuyin Xia, Qinghong Lai, Xinbo Gao, Guoyin Wang, Dongdong Cheng, Yi Liu, Yi Wang
Title: GBSK: Skeleton Clustering via Granular-ball Computing and Multi-Sampling for Large-Scale Data
Abstract:
To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical "skeleton" -- a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.
Chinese: GBSK算法采用粒球技术提出了一种可扩展的骨架聚类方法,通过降低计算成本高效处理大规模数据集并保持高精度,其自适应版本AGBSK简化了参数设置以提升实际应用性。
English: The GBSK algorithm introduces a scalable skeleton clustering method using granular-ball technology to efficiently process large-scale datasets by reducing computational costs while maintaining high accuracy, with an adaptive version AGBSK simplifying parameter settings for practical use.

Authors:Xincheng Yao, Chao Shi, Muming Zhao, Guangtao Zhai, Chongyang Zhang
Title: ResAD++: Towards Class Agnostic Anomaly Detection via Residual Feature Learning
Abstract:
This paper explores the problem of class-agnostic anomaly detection (AD), where the objective is to train one class-agnostic AD model that can generalize to detect anomalies in diverse new classes from different domains without any retraining or fine-tuning on the target data. When applied for new classes, the performance of current single- and multi-class AD methods is still unsatisfactory. One fundamental reason is that representation learning in existing methods is still class-related, namely, feature correlation. To address this issue, we propose residual features and construct a simple but effective framework, termed ResAD. Our core insight is to learn the residual feature distribution rather than the initial feature distribution. Residual features are formed by matching and then subtracting normal reference features. In this way, we can effectively realize feature decorrelation. Even in new classes, the distribution of normal residual features would not remarkably shift from the learned distribution. In addition, we think that residual features still have one issue: scale correlation. To this end, we propose a feature hypersphere constraining approach, which learns to constrain initial normal residual features into a spatial hypersphere for enabling the feature scales of different classes as consistent as possible. Furthermore, we propose a novel logbarrier bidirectional contraction OCC loss and vector quantization based feature distribution matching module to enhance ResAD, leading to the improved version of ResAD (ResAD++). Comprehensive experiments on eight real-world AD datasets demonstrate that our ResAD++ can achieve remarkable AD results when directly used in new classes, outperforming state-of-the-art competing methods and also surpassing ResAD. The code is available at https://github.com/xcyao00/ResAD.
中文: 本文提出了ResAD及其改进版ResAD++,通过残差特征和特征超球面约束方法实现类别无关的异常检测,有效解耦特征相关性并在新类别中保持稳定分布,在多个数据集上显著优于现有方法。
English: This paper introduces ResAD and its enhanced version ResAD++, which utilize residual features and a feature hypersphere approach to achieve class-agnostic anomaly detection by decorrelating features and maintaining consistent distributions across new classes, outperforming existing methods on multiple datasets.

Authors:Yinyi Wei, Xiao Li
Title: Text-to-Code Generation for Modular Building Layouts in Building Information Modeling
Abstract:
We present Text2MBL, a text-to-code generation framework that generates executable Building Information Modeling (BIM) code directly from textual descriptions of modular building layout (MBL) design. Unlike conventional layout generation approaches that operate in 2D space, Text2MBL produces fully parametric, semantically rich BIM layouts through on-the-fly code instantiation. To address MBLs' unique challenges due to their hierarchical three-tier structure: modules (physical building blocks), units (self-contained dwellings), and rooms (functional spaces), we developed an object-oriented code architecture and fine-tuned large language models to output structured action sequences in code format. To train and evaluate the framework, we curated a dataset of paired descriptions and ground truth layouts drawn from real-world modular housing projects. Performance was assessed using metrics for executable validity, semantic fidelity, and geometric consistency. By tightly unifying natural language understanding with BIM code generation, Text2MBL establishes a scalable pipeline from high-level conceptual design to automation-ready modular construction workflows. Our implementation is available at https://github.com/CI3LAB/Text2MBL.
中文: Text2MBL框架通过微调大语言模型和面向对象的代码架构,直接从模块化建筑布局的文本描述生成可执行的BIM代码,实现了从概念设计到自动化施工流程的参数化三维布局生成。
English: Text2MBL is a framework that generates executable BIM code from textual descriptions of modular building layouts, using fine-tuned language models and object-oriented architecture to create parametric 3D designs validated through real-world datasets.

Authors:Yunjiang Xu, Lingzhi Li, Jin Wang, Yupeng Ouyang, Benyuan Yang
Title: INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception
Abstract:
Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (INSTance-level INteraCtion ArchiTecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; and 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance. Specifically, our method achieves an improvement in accuracy 13.23%/33.08% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code is available at https://github.com/CrazyShout/INSTINCT.
中文:INSTINCT框架通过质量感知过滤和双分支路由的实例级交互机制,在显著提升检测精度的同时将通信带宽降至现有最优方法的1/281至1/264。
English: The proposed INSTINCT framework enhances collaborative perception by employing instance-level interaction with quality-aware filtering and dual-branch routing, achieving significant accuracy improvements and drastic bandwidth reduction compared to state-of-the-art methods.

Authors:Jianshuo Dong, Sheng Guo, Hao Wang, Zhuotao Liu, Tianwei Zhang, Ke Xu, Minlie Huang, Han Qiu
Title: SafeSearch: Automated Red-Teaming for the Safety of LLM-Based Search Agents
Abstract:
Search agents connect LLMs to the Internet, enabling access to broader and more up-to-date information. However, unreliable search results may also pose safety threats to end users, establishing a new threat surface. In this work, we conduct two in-the-wild experiments to demonstrate both the prevalence of low-quality search results and their potential to misguide agent behaviors. To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient, enabling lightweight and harmless safety assessments of search agents. Building on this framework, we construct the SafeSearch benchmark, which includes 300 test cases covering five categories of risks (e.g., misinformation and indirect prompt injection). Using this benchmark, we evaluate three representative search agent scaffolds, covering search workflow, tool-calling, and deep research, across 7 proprietary and 8 open-source backend LLMs. Our results reveal substantial vulnerabilities of LLM-based search agents: when exposed to unreliable websites, the highest ASR reached 90.5% for GPT-4.1-mini under a search workflow setting. Moreover, our analysis highlights the limited effectiveness of common defense practices, such as reminder prompting. This emphasizes the value of our framework in promoting transparency for safer agent development. Our codebase and test cases are publicly available: https://github.com/jianshuod/SafeSearch.
中文摘要:搜索代理使大语言模型能获取网络实时信息,但也因不可靠搜索结果带来安全威胁;本研究通过自动化红队评估框架和SafeSearch基准测试,揭示了现有系统的显著漏洞及常见防御措施的有限效果。
English Summary: Search agents enable LLMs to access current web information but introduce safety risks from unreliable results, prompting the development of an automated red-teaming framework and SafeSearch benchmark that reveal significant vulnerabilities in existing systems and limited effectiveness of common defenses.

Authors:Yifeng He, Luning Yang, Christopher Castro Gaw Gonzalo, Hao Chen
Title: TF-Bench: Evaluating Program Semantics Reasoning with Type Inference in System F
Abstract:
Large Language Models (LLMs) are increasingly integrated into the software engineering ecosystem. Their test-time compute (TTC) reasoning capabilities show significant potential for understanding program logic and semantics beyond mere token recognition. However, current benchmarks for code reasoning lack a formal, program-centric deductive framework to ensure sound evaluation, and are incapable of assessing whether models genuinely reason about program semantics or merely exploit superficial associations between natural language and code tokens. To bridge this gap, we introduce TF-Bench, a benchmark designed to evaluate LLM reasoning based on type inference in System F, a task we refer to as program semantics reasoning. By employing verified transformations to remove semantically irrelevant natural language, we construct TF-Bench_pure, a purely semantics-driven variant of TF-Bench. Our analysis reveals substantial limitations in state-of-the-art LLMs, with the best-performing LLM (Claude-3.7-sonnet) achieving only 55.85% accuracy on TF-Bench_pure. Additionally, we propose two novel metrics to assess robustness and the effectiveness of test-time reasoning, underscoring critical limitations in current LLM capabilities and highlighting essential directions for future research.
中文摘要:大语言模型在代码推理方面潜力显著但缺乏严谨评估框架,为此提出的TF-Bench通过类型推断任务和新型评估指标,揭示了当前模型在程序语义理解上存在明显不足。
English Summary: Large Language Models (LLMs) show promise in code reasoning but lack rigorous evaluation frameworks, leading to the creation of TF-Bench which reveals significant limitations in current models through type inference tasks and novel metrics.

Authors:Danni Yang, Zhikang Chen, Sen Cui, Mengyue Yang, Ding Li, Abudukelimu Wuerkaixi, Haoxuan Li, Jinke Ren, Mingming Gong
Title: Decentralized Dynamic Cooperation of Personalized Models for Federated Continual Learning
Abstract:
Federated continual learning (FCL) has garnered increasing attention for its ability to support distributed computation in environments with evolving data distributions. However, the emergence of new tasks introduces both temporal and cross-client shifts, making catastrophic forgetting a critical challenge. Most existing works aggregate knowledge from clients into a global model, which may not enhance client performance since irrelevant knowledge could introduce interference, especially in heterogeneous scenarios. Additionally, directly applying decentralized approaches to FCL suffers from ineffective group formation caused by task changes. To address these challenges, we propose a decentralized dynamic cooperation framework for FCL, where clients establish dynamic cooperative learning coalitions to balance the acquisition of new knowledge and the retention of prior learning, thereby obtaining personalized models. To maximize model performance, each client engages in selective cooperation, dynamically allying with others who offer meaningful performance gains. This results in non-overlapping, variable coalitions at each stage of the task. Moreover, we use coalitional affinity game to simulate coalition relationships between clients. By assessing both client gradient coherence and model similarity, we quantify the client benefits derived from cooperation. We also propose a merge-blocking algorithm and a dynamic cooperative evolution algorithm to achieve cooperative and dynamic equilibrium. Comprehensive experiments demonstrate the superiority of our method compared to various baselines. Code is available at: https://github.com/ydn3229/DCFCL.
中文: 本文提出了一种去中心化的动态协作联邦持续学习框架,通过基于梯度一致性和模型相似性的选择性合作,使客户端形成自适应联盟以优化个性化模型性能,有效缓解异构环境下的灾难性遗忘问题。
English: This paper introduces a decentralized dynamic cooperation framework for federated continual learning, enabling clients to form adaptive coalitions that enhance personalized model performance by selectively collaborating based on gradient coherence and model similarity, effectively mitigating catastrophic forgetting in heterogeneous environments.

Authors:Weilun Feng, Chuanguang Yang, Haotong Qin, Mingqiang Wu, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Title: QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification
Abstract:
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment. Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression. Combining them promises compounded efficiency gains, but naive integration is ineffective. The sparsity-induced information loss exacerbates quantization noise, leading to amplified attention shifts. To address this, we propose \textbf{QuantSparse}, a unified framework that integrates model quantization with attention sparsification. Specifically, we introduce \textit{Multi-Scale Salient Attention Distillation}, which leverages both global structural guidance and local salient supervision to mitigate quantization-induced bias. In addition, we develop \textit{Second-Order Sparse Attention Reparameterization}, which exploits the temporal stability of second-order residuals to efficiently recover information lost under sparsity. Experiments on HunyuanVideo-13B demonstrate that QuantSparse achieves 20.88 PSNR, substantially outperforming the state-of-the-art quantization baseline Q-VDiT (16.85 PSNR), while simultaneously delivering a \textbf{3.68$\times$} reduction in storage and \textbf{1.88$\times$} acceleration in end-to-end inference. Our code will be released in https://github.com/wlfeng0509/QuantSparse.
中文摘要:QuantSparse是一个统一压缩框架,通过多尺度注意力蒸馏和二阶稀疏重参数化技术,有效结合模型量化和注意力稀疏化,在显著降低存储和加速推理的同时实现了更优的视频生成质量。
English Summary: QuantSparse is a unified compression framework that effectively combines model quantization and attention sparsification through multi-scale attention distillation and second-order sparse reparameterization, achieving superior video generation quality with significantly reduced storage and accelerated inference.

Authors:Dayu Tan, Ziwei Zhang, Yansan Su, Xin Peng, Yike Dai, Chunhou Zheng, Weimin Zhong
Title: MSD-KMamba: Bidirectional Spatial-Aware Multi-Modal 3D Brain Segmentation via Multi-scale Self-Distilled Fusion Strategy
Abstract:
Numerous CNN-Transformer hybrid models rely on high-complexity global attention mechanisms to capture long-range dependencies, which introduces non-linear computational complexity and leads to significant resource consumption. Although knowledge distillation and sparse attention mechanisms can improve efficiency, they often fall short of delivering the high segmentation accuracy necessary for complex tasks. Balancing model performance with computational efficiency remains a critical challenge. In this work, we propose a novel 3D multi-modal image segmentation framework, termed MSD-KMamba, which integrates bidirectional spatial perception with multi-scale self-distillation. The bidirectional spatial aware branch effectively captures long-range spatial context dependencies across brain regions, while also incorporating a powerful nonlinear feature extraction mechanism that further enhances the model's ability to learn complex and heterogeneous patterns. In addition, the proposed multi-scale self-distilled fusion strategy strengthens hierarchical feature representations and improves the transfer of semantic information at different resolution levels. By jointly leveraging the bidirectional spatial perception branch and the multi-scale self-distilled fusion strategy, our framework effectively mitigates the bottleneck of quadratic computational complexity in volumetric segmentation, while simultaneously addressing the limitation of insufficient global perception. Extensive experiments on multiple standard benchmark datasets demonstrate that MSD-KMamba consistently outperforms state-of-the-art methods in segmentation accuracy, robustness, and generalization, while maintaining high computational efficiency and favorable scalability. The source code of MSD-KMamba is publicly available at https://github.com/daimao-zhang/MSD-KMamba.
中文: MSD-KMamba框架通过双向空间感知和多尺度自蒸馏技术,在三维多模态图像分割中有效捕获长程依赖并增强特征表示,相比现有方法实现了更高的分割精度和计算效率。
English: The MSD-KMamba framework introduces bidirectional spatial perception and multi-scale self-distillation to effectively capture long-range dependencies and enhance feature representation in 3D multi-modal image segmentation, achieving superior accuracy and efficiency compared to existing methods.

Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Title: Beyond Greedy Exits: Improved Early Exit Decisions for Risk Control and Reliability
Abstract:
Early-Exit Deep Neural Networks enable adaptive inference by allowing prediction at intermediary layers, significantly reducing computational costs and latency. Most of the early exit strategies greedily exit a sample at an intermediary layer if the confidence in class prediction exceeds a predefined threshold that is set using a static validation set. This is problematic as the model might be overconfident in a wrong class. Also, they are not robust to distribution shifts encountered in deployment, which can undermine model trustworthiness and accuracy. To address these challenges, we propose UAT that adapts the threshold for exit decisions using a Multi-Armed Bandit framework, enabling online, unsupervised adjustment of exit decisions. UAT makes decisions based on a new reward function that assesses predictive certainty and its reliability to balance computational efficiency and prediction quality while penalizing unnecessary late exits. We provide guarantees on risk achieved by UAT and validate its performance on diverse tasks spanning vision-language understanding, text generation, and classification. Our framework demonstrates consistent improvements in speedup (1.70-2.10x) with a minimal performance drop (<2%) as compared to full model performance. Our source code is available at https://github.com/Div290/UAT.
中文摘要:提出的UAT框架通过多臂老虎机方法自适应调整退出阈值,解决了早期退出深度神经网络中的过度自信和分布偏移问题,在实现显著加速(1.70-2.10倍)的同时保持性能损失最小(<2%)。
English Summary: The proposed UAT framework adaptively adjusts exit thresholds using a Multi-Armed Bandit approach to address overconfidence and distribution shift issues in Early-Exit DNNs, achieving significant speedup (1.70-2.10x) with minimal performance loss (<2%).

Authors:Kristina P. Sinaga, Arjun S. Nair
Title: Calibration Meets Reality: Making Machine Learning Predictions Trustworthy
Abstract:
Post-hoc calibration methods are widely used to improve the reliability of probabilistic predictions from machine learning models. Despite their prevalence, a comprehensive theoretical understanding of these methods remains elusive, particularly regarding their performance across different datasets and model architectures. Input features play a crucial role in shaping model predictions and, consequently, their calibration. However, the interplay between feature quality and calibration performance has not been thoroughly investigated. In this work, we present a rigorous theoretical analysis of post-hoc calibration methods, focusing on Platt scaling and isotonic regression. We derive convergence guarantees, computational complexity bounds, and finite-sample performance metrics for these methods. Furthermore, we explore the impact of feature informativeness on calibration performance through controlled synthetic experiments. Our empirical evaluation spans a diverse set of real-world datasets and model architectures, demonstrating consistent improvements in calibration metrics across various scenarios. By examining calibration performance under varying feature conditions utilizing only informative features versus complete feature spaces including noise dimensions, we provide fundamental insights into the robustness and reliability of different calibration approaches. Our findings offer practical guidelines for selecting appropriate calibration methods based on dataset characteristics and computational constraints, bridging the gap between theoretical understanding and practical implementation in uncertainty quantification. Code and experimental data are available at: https://github.com/Ajwebdevs/calibration-analysis-experiments.
中文摘要:本研究对后验校准方法进行了系统的理论与实证分析,揭示了特征质量对校准性能的影响机制,并基于数据集特性提出了实用的校准方法选择指南。
English Summary: This study provides a comprehensive theoretical and empirical analysis of post-hoc calibration methods, revealing how feature quality impacts calibration performance and offering practical guidelines for method selection based on dataset characteristics.

Authors:Fanlong Zeng, Wensheng Gan, Jiayang Wu, Philip S. Yu
Title: Pure Node Selection for Imbalanced Graph Node Classification
Abstract:
The problem of class imbalance refers to an uneven distribution of quantity among classes in a dataset, where some classes are significantly underrepresented compared to others. Class imbalance is also prevalent in graph-structured data. Graph neural networks (GNNs) are typically based on the assumption of class balance, often overlooking the issue of class imbalance. In our investigation, we identified a problem, which we term the Randomness Anomalous Connectivity Problem (RACP), where certain off-the-shelf models are affected by random seeds, leading to a significant performance degradation. To eliminate the influence of random factors in algorithms, we proposed PNS (Pure Node Sampling) to address the RACP in the node synthesis stage. Unlike existing approaches that design specialized algorithms to handle either quantity imbalance or topological imbalance, PNS is a novel plug-and-play module that operates directly during node synthesis to mitigate RACP. Moreover, PNS also alleviates performance degradation caused by abnormal distribution of node neighbors. We conduct a series of experiments to identify what factors are influenced by random seeds. Experimental results demonstrate the effectiveness and stability of our method, which not only eliminates the effect of unfavorable random seeds but also outperforms the baseline across various benchmark datasets with different GNN backbones. Data and code are available at https://github.com/flzeng1/PNS.
Chinese: 本研究提出了纯节点采样(PNS)模块,作为一种即插即用的解决方案,在节点合成阶段有效应对图数据中的随机性异常连接问题,消除了随机种子导致的性能下降。
English: The study introduces Pure Node Sampling (PNS), a plug-and-play module that addresses the Randomness Anomalous Connectivity Problem in class-imbalanced graph data by mitigating performance degradation caused by random seeds during node synthesis.

Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Kangli Zi, Qingming Huang
Title: LightFair: Towards an Efficient Alternative for Fair T2I Diffusion via Debiasing Pre-trained Text Encoders
Abstract:
This paper explores a novel lightweight approach LightFair to achieve fair text-to-image diffusion models (T2I DMs) by addressing the adverse effects of the text encoder. Most existing methods either couple different parts of the diffusion model for full-parameter training or rely on auxiliary networks for correction. They incur heavy training or sampling burden and unsatisfactory performance. Since T2I DMs consist of multiple components, with the text encoder being the most fine-tunable and front-end module, this paper focuses on mitigating bias by fine-tuning text embeddings. To validate feasibility, we observe that the text encoder's neutral embedding output shows substantial skewness across image embeddings of various attributes in the CLIP space. More importantly, the noise prediction network further amplifies this imbalance. To finetune the text embedding, we propose a collaborative distance-constrained debiasing strategy that balances embedding distances to improve fairness without auxiliary references. However, mitigating bias can compromise the original generation quality. To address this, we introduce a two-stage text-guided sampling strategy to limit when the debiased text encoder intervenes. Extensive experiments demonstrate that LightFair is effective and efficient. Notably, on Stable Diffusion v1.5, our method achieves SOTA debiasing at just $1/4$ of the training burden, with virtually no increase in sampling burden. The code is available at https://github.com/boyuh/LightFair.
中文摘要:本文提出LightFair轻量方法,通过距离约束去偏策略微调文本嵌入并结合两阶段采样,有效提升文本到图像扩散模型的公平性,以仅四分之一训练负担实现最优去偏效果且几乎不增加采样成本。
English Summary: This paper introduces LightFair, a lightweight method that enhances fairness in text-to-image diffusion models by fine-tuning text embeddings with a distance-constrained debiasing strategy and a two-stage sampling approach, achieving state-of-the-art performance with significantly reduced training and sampling overhead.

Authors:Cheng Huang, Weizheng Xie, Fan Gao, Yutong Liu, Ruoling Wu, Zeyu Han, Jingxi Qiu, Xiangxiang Wang, Zhenglin Yang, Hao Wang, Yongbin Yu
Title: BioVessel-Net and RetinaMix: Unsupervised Retinal Vessel Segmentation from OCTA Images
Abstract:
Structural changes in retinal blood vessels are critical biomarkers for the onset and progression of glaucoma and other ocular diseases. However, current vessel segmentation approaches largely rely on supervised learning and extensive manual annotations, which are costly, error-prone, and difficult to obtain in optical coherence tomography angiography. Here we present BioVessel-Net, an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement and a radius-guided segmentation strategy. Unlike pixel-based methods, BioVessel-Net directly models vascular structures with biostatistical coherence, achieving accurate and explainable vessel extraction without labeled data or high-performance computing. To support training and evaluation, we introduce RetinaMix, a new benchmark dataset of 2D and 3D OCTA images with high-resolution vessel details from diverse populations. Experimental results demonstrate that BioVessel-Net achieves near-perfect segmentation accuracy across RetinaMix and existing datasets, substantially outperforming state-of-the-art supervised and semi-supervised methods. Together, BioVessel-Net and RetinaMix provide a label-free, computationally efficient, and clinically interpretable solution for retinal vessel analysis, with broad potential for glaucoma monitoring, blood flow modeling, and progression prediction. Code and dataset are available: https://github.com/VikiXie/SatMar8.
中文摘要:BioVessel-Net是一个无监督生成框架,通过整合血管生物统计学与对抗性优化,无需人工标注即可实现精准的视网膜血管分割,其性能超越现有监督方法且具备临床可解释性。
English Summary: BioVessel-Net is an unsupervised generative framework that integrates vessel biostatistics with adversarial refinement to achieve accurate retinal vessel segmentation without manual annotations, outperforming existing supervised methods while providing clinical interpretability.

Authors:Fanlong Zeng, Wensheng Gan, Philip S. Yu
Title: GraphIFE: Rethinking Graph Imbalance Node Classification via Invariant Learning
Abstract:
The class imbalance problem refers to the disproportionate distribution of samples across different classes within a dataset, where the minority classes are significantly underrepresented. This issue is also prevalent in graph-structured data. Most graph neural networks (GNNs) implicitly assume a balanced class distribution and therefore often fail to account for the challenges introduced by class imbalance, which can lead to biased learning and degraded performance on minority classes. We identify a quality inconsistency problem in synthesized nodes, which leads to suboptimal performance under graph imbalance conditions. To mitigate this issue, we propose GraphIFE (Graph Invariant Feature Extraction), a novel framework designed to mitigate quality inconsistency in synthesized nodes. Our approach incorporates two key concepts from graph invariant learning and introduces strategies to strengthen the embedding space representation, thereby enhancing the model's ability to identify invariant features. Extensive experiments demonstrate the framework's efficiency and robust generalization, as GraphIFE consistently outperforms various baselines across multiple datasets. The code is publicly available at https://github.com/flzeng1/GraphIFE.
Chinese Summary: 本文提出GraphIFE框架,通过图不变特征学习和增强嵌入表示来解决图数据中类别不平衡问题,有效缓解合成节点质量不一致性并提升模型性能。
English Summary: The paper introduces GraphIFE, a novel framework that addresses class imbalance in graph data by mitigating quality inconsistency in synthesized nodes through invariant feature extraction and enhanced embedding strategies.

Authors:Han Hu, Zhuoran Zheng, Liang Li, Chen Lyu
Title: VAMamba: An Efficient Visual Adaptive Mamba for Image Restoration
Abstract:
Recent Mamba-based image restoration methods have achieved promising results but remain limited by fixed scanning patterns and inefficient feature utilization. Conventional Mamba architectures rely on predetermined paths that cannot adapt to diverse degradations, constraining both restoration performance and computational efficiency. To overcome these limitations, we propose VAMamba, a Visual Adaptive Mamba framework with two key innovations. First, QCLAM(Queue-basedCacheLow-rankAdaptiveMemory)enhancesfeaturelearningthrougha FIFO cache that stores historical representations. Similarity between current LoRA-adapted and cached features guides intelligent fusion, enabling dynamic reuse while effectively controlling memorygrowth.Second, GPS-SS2D(GreedyPathScanSS2D)introducesadaptive scanning. A Vision Transformer generates score maps to estimate pixel importance, and a greedy strategy de termines optimal forward and backward scanning paths. These learned trajectories replace rigid patterns, enabling SS2D to perform targeted feature extraction. The integration of QCLAM and GPS-SS2D allows VAMamba to adaptively focus on degraded regions while maintaining high computational efficiency. Extensive experiments across diverse restoration tasks demonstrate that VAMamba consistently outperforms existing approaches in both restoration quality and efficiency, establishing new benchmarks for adaptive image restoration. Our code is available at https://github.com/WaterHQH/VAMamba.
中文摘要:VAMamba通过QCLAM实现动态特征复用和GPS-SS2D自适应扫描路径,在多种图像修复任务中实现了卓越的修复质量与计算效率。
English Summary: VAMamba introduces QCLAM for dynamic feature reuse and GPS-SS2D for adaptive scanning paths, achieving superior image restoration quality and efficiency across diverse tasks.

Authors:Kaicheng Yang, Xun Zhang, Haotong Qin, Yucheng Lin, Kaisen Yang, Xianglong Yan, Yulun Zhang
Title: RobuQ: Pushing DiTs to W1.58A2 via Robust Activation Quantization
Abstract:
Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for image generation, demonstrating superior scalability and performance over U-Net architectures. However, their practical deployment is hindered by substantial computational and memory costs. While Quantization-Aware Training (QAT) has shown promise for U-Nets, its application to DiTs faces unique challenges, primarily due to the sensitivity and distributional complexity of activations. In this work, we identify activation quantization as the primary bottleneck for pushing DiTs to extremely low-bit settings. To address this, we propose a systematic QAT framework for DiTs, named RobuQ. We start by establishing a strong ternary weight (W1.58A4) DiT baseline. Building upon this, we propose RobustQuantizer to achieve robust activation quantization. Our theoretical analyses show that the Hadamard transform can convert unknown per-token distributions into per-token normal distributions, providing a strong foundation for this method. Furthermore, we propose AMPN, the first Activation-only Mixed-Precision Network pipeline for DiTs. This method applies ternary weights across the entire network while allocating different activation precisions to each layer to eliminate information bottlenecks. Through extensive experiments on unconditional and conditional image generation, our RobuQ framework achieves state-of-the-art performance for DiT quantization in sub-4-bit quantization configuration. To the best of our knowledge, RobuQ is the first achieving stable and competitive image generation on large datasets like ImageNet-1K with activations quantized to average 2 bits. The code and models will be available at https://github.com/racoonykc/RobuQ .
中文摘要:RobuQ框架通过引入RobustQuantizer和AMPN方法,解决了扩散变换器中激活量化的关键难题,在低于4比特的量化配置下实现了最先进的图像生成性能。
English Summary: The RobuQ framework addresses activation quantization challenges in Diffusion Transformers (DiTs) by introducing RobustQuantizer and AMPN pipeline, achieving state-of-the-art sub-4-bit quantization performance with competitive image generation quality.

Authors:Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Yang Xiang, Buzhou Tang
Title: Towards Efficient CoT Distillation: Self-Guided Rationale Selector for Better Performance with Fewer Rationales
Abstract:
Chain-of-thought (CoT) distillation aims to enhance small language models' (SLMs) reasoning by transferring multi-step reasoning capability from the larger teacher models. However, existing work underestimates rationale quality, focusing primarily on data quantity, which may transfer noisy or incorrect information to the student model. To address the above issues, we proposed \textbf{M}odel-\textbf{O}riented \textbf{R}ationale \textbf{S}election \textbf{D}istillation (MoRSD), which can discern and select high quality rationales for distillation to improve performance further. We further propose a Rationale Difficulty (RD) metric to measure the ability of the student model to generate the correct answer under a given rationale. Compared to the baseline, we achieved 4.6$\%$ average improvement on seven datasets over three tasks, using fewer rationales by controlling their accuracy, diversity, and difficulty. Our results reveal that a small portion of the high quality rationales can enhance the reasoning ability of student models than the entire dataset. Our method promises to be a possible solution for efficient CoT distillation. Our code will be released in https://github.com/Leon221220/MoRSD.
中文:MoRSD通过基于准确性、多样性和难度选择高质量推理链,显著提升了思维链蒸馏的效果,仅用少量训练样本就实现了性能的大幅提升。
English: MoRSD enhances chain-of-thought distillation by selecting high-quality rationales based on accuracy, diversity, and difficulty, achieving significant performance improvements with fewer training examples.

Authors:Min-Hsuan Yeh, Yixuan Li
Title: Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment
Abstract:
Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.
中文: 本文提出了首个系统性评估13种偏好数据清洗方法的大模型对齐基准PrefCleanBench,揭示了数据清洗成功的关键因素,并强调了数据预处理在负责任AI开发中的重要作用。
English: This paper introduces PrefCleanBench, the first comprehensive benchmark to systematically evaluate 13 preference data cleaning methods for improving large language model alignment, revealing key factors for success and emphasizing data preprocessing's critical role in responsible AI development.

Authors:Hamidreza Rouzegar, Masoud Makrehchi
Title: The Impact of Role Design in In-Context Learning for Large Language Models
Abstract:
In-context learning (ICL) enables Large Language Models (LLMs) to generate predictions based on prompts without additional fine-tuning. While prompt engineering has been widely studied, the impact of role design within prompts remains underexplored. This study examines the influence of role configurations in zero-shot and few-shot learning scenarios using GPT-3.5 and GPT-4o from OpenAI and Llama2-7b and Llama2-13b from Meta. We evaluate the models' performance across datasets, focusing on tasks like sentiment analysis, text classification, question answering, and math reasoning. Our findings suggest the potential of role-based prompt structuring to enhance LLM performance.
中文: 本研究探讨了提示中角色配置对大型语言模型在零样本和少样本学习中表现的影响,发现基于角色的结构设计能够提升模型在多种任务中的性能。
English: This study explores how role configurations in prompts affect the performance of large language models in zero-shot and few-shot learning, revealing that role-based structuring can enhance their effectiveness across various tasks.

Authors:Jie Yang, Yifan Hu, Kexin Zhang, Luyang Niu, Yushun Dong, Philip S. Yu, Kaize Ding
Title: Revisiting Multivariate Time Series Forecasting with Missing Values
Abstract:
Missing values are common in real-world time series, and multivariate time series forecasting with missing values (MTSF-M) has become a crucial area of research for ensuring reliable predictions. To address the challenge of missing data, current approaches have developed an imputation-then-prediction framework that uses imputation modules to fill in missing values, followed by forecasting on the imputed data. However, this framework overlooks a critical issue: there is no ground truth for the missing values, making the imputation process susceptible to errors that can degrade prediction accuracy. In this paper, we conduct a systematic empirical study and reveal that imputation without direct supervision can corrupt the underlying data distribution and actively degrade prediction accuracy. To address this, we propose a paradigm shift that moves away from imputation and directly predicts from the partially observed time series. We introduce Consistency-Regularized Information Bottleneck (CRIB), a novel framework built on the Information Bottleneck principle. CRIB combines a unified-variate attention mechanism with a consistency regularization scheme to learn robust representations that filter out noise introduced by missing values while preserving essential predictive signals. Comprehensive experiments on four real-world datasets demonstrate the effectiveness of CRIB, which predicts accurately even under high missing rates. Our code is available in https://github.com/Muyiiiii/CRIB.
中文摘要:本文提出CRIB框架,无需填补缺失值即可直接从不完整时间序列进行预测,避免了填补误差导致的精度下降,实验证明其在高缺失率下仍能保持准确预测。
English Summary: The paper introduces the CRIB framework, which directly forecasts from incomplete time series without imputation to prevent accuracy degradation caused by imputation errors, demonstrating superior performance even with high missing data rates.

Authors:Junyi Wu, Jiachen Tao, Haoxuan Wang, Gaowen Liu, Ramana Rao Kompella, Yan Yan
Title: Orientation-anchored Hyper-Gaussian for 4D Reconstruction from Casual Videos
Abstract:
We present Orientation-anchored Gaussian Splatting (OriGS), a novel framework for high-quality 4D reconstruction from casually captured monocular videos. While recent advances extend 3D Gaussian Splatting to dynamic scenes via various motion anchors, such as graph nodes or spline control points, they often rely on low-rank assumptions and fall short in modeling complex, region-specific deformations inherent to unconstrained dynamics. OriGS addresses this by introducing a hyperdimensional representation grounded in scene orientation. We first estimate a Global Orientation Field that propagates principal forward directions across space and time, serving as stable structural guidance for dynamic modeling. Built upon this, we propose Orientation-aware Hyper-Gaussian, a unified formulation that embeds time, space, geometry, and orientation into a coherent probabilistic state. This enables inferring region-specific deformation through principled conditioned slicing, adaptively capturing diverse local dynamics in alignment with global motion intent. Experiments demonstrate the superior reconstruction fidelity of OriGS over mainstream methods in challenging real-world dynamic scenes.
中文: OriGS通过基于场景方向的高维表示,解决了单目视频4D重建中复杂区域特定形变的建模难题,在动态场景中实现了卓越的还原精度。
English: OriGS introduces a hyperdimensional representation based on scene orientation to model complex, region-specific deformations in 4D reconstruction from monocular videos, achieving superior fidelity in dynamic scenes.

Authors:Jiang-Xin Shi, Wen-Da Wei, Jin-Fei Qi, Xuanyu Chen, Tong Wei, Yu-Feng Li
Title: Memory-Efficient Fine-Tuning via Low-Rank Activation Compression
Abstract:
The parameter-efficient fine-tuning paradigm has garnered significant attention with the advancement of foundation models. Although numerous methods have been proposed to reduce the number of trainable parameters, their substantial memory overhead remains a critical bottleneck that hinders practical deployment. In this paper, we observe that model activations constitute a major source of memory consumption, especially under large batch sizes and long context lengths; however, the rank of the activations remains consistently low. Motivated by this insight, we propose a memory-efficient fine-tuning approach Low-Rank Activation Compression (LoRAct). Unlike prior work, LoRAct provides a more flexible and versatile compressing strategy that can be applied online during the forward pass without the need for any calibration data. Moreover, LoRAct incorporates a novel sampling-based orthogonal decomposition algorithm specifically designed for low-rank matrices, offering improved computational efficiency and a tighter error bound compared to the widely used RSVD. Experiments on both vision and language tasks demonstrate the effectiveness of LoRAct. Notably, LoRAct further reduces activation memory by approximately 80% in comparison with the widely adopted LoRA method, while maintaining competitive performance. The source code is available at https://github.com/shijxcs/meft.
中文: 本文提出LoRAct这一内存高效微调方法,无需校准数据即可在线压缩低秩激活,相比LoRA能减少约80%的激活内存占用,同时保持性能竞争力。
English: The paper introduces LoRAct, a memory-efficient fine-tuning method that compresses low-rank activations online without calibration data, reducing activation memory by about 80% compared to LoRA while maintaining performance.

Authors:Mohammad Hossein Sameti, Amir M. Mansourian, Arash Marioriyad, Soheil Fadaee Oshyani, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Title: No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation
Abstract:
Despite recent advances in text-to-image (T2I) models, they often fail to faithfully render all elements of complex prompts, frequently omitting or misrepresenting specific objects and attributes. Test-time optimization has emerged as a promising approach to address this limitation by refining generation without the need for retraining. In this paper, we propose a fine-grained test-time optimization framework that enhances compositional faithfulness in T2I generation. Unlike most of prior approaches that rely solely on a global image/text similarity score, our method decomposes the input prompt into semantic concepts and evaluates alignment at both the global and concept levels. A fine-grained variant of CLIP is used to compute concept-level correspondence, producing detailed feedback on missing or inaccurate concepts. This feedback is fed into an iterative prompt refinement loop, enabling the large language model to propose improved prompts. Experiments on DrawBench and CompBench prompts demonstrate that our method significantly improves concept coverage and human-judged faithfulness over both standard test-time optimization and the base T2I model. Code is available at: https://github.com/AmirMansurian/NoConceptLeftBehind
中文: 本文提出了一种细粒度的测试时优化框架,通过将提示分解为语义概念并利用迭代反馈来提高概念覆盖率和生成忠实度,在文本到图像生成任务中显著优于现有方法。
English: This paper introduces a fine-grained test-time optimization framework that enhances text-to-image generation by decomposing prompts into semantic concepts and using iterative feedback to improve concept coverage and faithfulness, outperforming existing methods.

Authors:Tharindu Ekanayake, Constantino Álvarez Casado, Miguel Bordallo López
Title: 3DPCNet: Pose Canonicalization for Robust Viewpoint-Invariant 3D Kinematic Analysis from Monocular RGB cameras
Abstract:
Monocular 3D pose estimators produce camera-centered skeletons, creating view-dependent kinematic signals that complicate comparative analysis in applications such as health and sports science. We present 3DPCNet, a compact, estimator-agnostic module that operates directly on 3D joint coordinates to rectify any input pose into a consistent, body-centered canonical frame. Its hybrid encoder fuses local skeletal features from a graph convolutional network with global context from a transformer via a gated cross-attention mechanism. From this representation, the model predicts a continuous 6D rotation that is mapped to an $SO(3)$ matrix to align the pose. We train the model in a self-supervised manner on the MM-Fi dataset using synthetically rotated poses, guided by a composite loss ensuring both accurate rotation and pose reconstruction. On the MM-Fi benchmark, 3DPCNet reduces the mean rotation error from over 20$^{\circ}$ to 3.4$^{\circ}$ and the Mean Per Joint Position Error from ~64 mm to 47 mm compared to a geometric baseline. Qualitative evaluations on the TotalCapture dataset further demonstrate that our method produces acceleration signals from video that show strong visual correspondence to ground-truth IMU sensor data, confirming that our module removes viewpoint variability to enable physically plausible motion analysis.
中文:3DPCNet是一个紧凑型模块,通过自监督旋转对齐将视角依赖的3D姿态转换为统一的身体坐标系表示,将旋转误差从超过20°降至3.4°,位置误差降至47毫米,显著提升了运动分析的准确性。
English: 3DPCNet is a compact module that converts view-dependent 3D poses into consistent body-centered representations through self-supervised rotation alignment, significantly improving motion analysis accuracy by reducing rotation errors from over 20° to 3.4° and position errors to 47 mm.

Authors:Sahithya Ravi, Aditya Chinchure, Raymond T. Ng, Leonid Sigal, Vered Shwartz
Title: SPIKE-RL: Video-LLMs meet Bayesian Surprise
Abstract:
Real-world videos often show routine activities punctuated by memorable, surprising events. However, most Video-LLMs process videos by sampling frames uniformly, likely missing critical moments that define a video's narrative. We introduce SPIKE, an inference-time framework that quantifies Bayesian Surprise as the belief update triggered by new visual evidence in the video stream, identifying moments where new visual evidence conflicts with prior beliefs. SPIKE effectively localizes surprise in videos, strongly correlated with humans on positive (FunQA) and negative (Oops!) surprise benchmarks. Since the beliefs of zero-shot Video-LLMs are often suboptimal, we develop SPIKE-RL, which leverages GRPO to optimize belief hypotheses based on a reward signal from the video caption. SPIKE and SPIKE-RL guide query-agnostic surprise-weighted frame sampling, which allocates more frames to interesting moments in the video. With this strategy, we achieve consistent performance gains on five downstream benchmarks over uniform sampling. By enabling Video-LLMs to track beliefs and register surprise, our work paves the way for more robust models that can revise their understanding in response to new information.
中文摘要:SPIKE是一种推理时框架,通过量化贝叶斯惊喜来定位视频中的意外时刻,其引导的帧采样策略能在五个下游基准测试中持续超越均匀采样性能,使视频大语言模型能更好地捕捉关键叙事节点。
English Summary: SPIKE is an inference-time framework that identifies surprising moments in videos by measuring Bayesian Surprise, enabling optimized frame sampling that improves performance across multiple benchmarks by focusing on key narrative events.

Authors:Rajaa El Hamdani, Samy Haffoudhi, Nils Holzenberger, Fabian Suchanek, Thomas Bonald, Fragkiskos D. Malliaros
Title: Retrieval-Constrained Decoding Reveals Underestimated Parametric Knowledge in Language Models
Abstract:
Language models (LMs) encode substantial factual knowledge, but often produce answers judged as incorrect. We hypothesize that many of these answers are actually correct, but are expressed in alternative surface forms that are dismissed due to an overly strict evaluation, leading to an underestimation of models' parametric knowledge. We propose Retrieval-Constrained Decoding (RCD), a decoding strategy that restricts model outputs to unique surface forms. We introduce YAGO-QA, a dataset of 19,137 general knowledge questions. Evaluating open-source LMs from 135M to 70B parameters, we show that standard decoding undervalues their knowledge. For instance, Llama-3.1-70B scores only 32.3% F1 with vanilla decoding but 46.0% with RCD. Similarly, Llama-3.1-8B reaches 33.0% with RCD, outperforming the larger model under vanilla decoding. We publicly share the code and dataset at https://github.com/Rajjaa/disambiguated-LLM.
Chinese: 语言模型常因严格评估而被忽视其替代形式的正确答案,但采用检索约束解码策略可显著提升性能,这在YAGO-QA数据集上得到了验证。
English: Language models often produce correct answers in alternative forms that are dismissed by strict evaluations, but using Retrieval-Constrained Decoding significantly improves their performance, as demonstrated on the YAGO-QA dataset.

Authors:Wenhang Shi, Yiren Chen, Shuqing Bian, Xinyi Zhang, Kai Tang, Pengfei Hu, Zhe Zhao, Wei Lu, Xiaoyong Du
Title: No Loss, No Gain: Gated Refinement and Adaptive Compression for Prompt Optimization
Abstract:
Prompt engineering is crucial for leveraging the full potential of large language models (LLMs). While automatic prompt optimization offers a scalable alternative to costly manual design, generating effective prompts remains challenging. Existing methods often struggle to stably generate improved prompts, leading to low efficiency, and overlook that prompt optimization easily gets trapped in local optima. Addressing this, we propose GRACE, a framework that integrates two synergistic strategies: Gated Refinement and Adaptive Compression, achieving Efficient prompt optimization. The gated refinement strategy introduces a feedback regulation gate and an update rejection gate, which refine update signals to produce stable and effective prompt improvements. When optimization stagnates, the adaptive compression strategy distills the prompt's core concepts, restructuring the optimization trace and opening new paths. By strategically introducing information loss through refinement and compression, GRACE delivers substantial gains in performance and efficiency. In extensive experiments on 11 tasks across three practical domains, including BIG-Bench Hard (BBH), domain-specific, and general NLP tasks, GRACE achieves significant average relative performance improvements of 4.7%, 4.4% and 2.7% over state-of-the-art methods, respectively. Further analysis shows that GRACE achieves these gains using only 25% of the prompt generation budget required by prior methods, highlighting its high optimization efficiency and low computational overhead. Our code is available at https://github.com/Eric8932/GRACE.
Chinese: GRACE框架通过门控优化和自适应压缩策略解决提示优化中的局部最优问题,在仅需25%计算预算的情况下实现了显著性能提升。
English: The GRACE framework introduces gated refinement and adaptive compression strategies to overcome local optima in prompt optimization, achieving significant performance gains with only 25% of the computational budget compared to existing methods.

Authors:Xi Ding, Lei Wang, Piotr Koniusz, Yongsheng Gao
Title: Graph Your Own Prompt
Abstract:
We propose Graph Consistency Regularization (GCR), a novel framework that injects relational graph structures, derived from model predictions, into the learning process to promote class-aware, semantically meaningful feature representations. Functioning as a form of self-prompting, GCR enables the model to refine its internal structure using its own outputs. While deep networks learn rich representations, these often capture noisy inter-class similarities that contradict the model's predicted semantics. GCR addresses this issue by introducing parameter-free Graph Consistency Layers (GCLs) at arbitrary depths. Each GCL builds a batch-level feature similarity graph and aligns it with a global, class-aware masked prediction graph, derived by modulating softmax prediction similarities with intra-class indicators. This alignment enforces that feature-level relationships reflect class-consistent prediction behavior, acting as a semantic regularizer throughout the network. Unlike prior work, GCR introduces a multi-layer, cross-space graph alignment mechanism with adaptive weighting, where layer importance is learned from graph discrepancy magnitudes. This allows the model to prioritize semantically reliable layers and suppress noisy ones, enhancing feature quality without modifying the architecture or training procedure. GCR is model-agnostic, lightweight, and improves semantic structure across various networks and datasets. Experiments show that GCR promotes cleaner feature structure, stronger intra-class cohesion, and improved generalization, offering a new perspective on learning from prediction structure. [Project website](https://darcyddx.github.io/gcr/) [Code](https://github.com/Darcyddx/graph-prompt)
中文摘要:图一致性正则化(GCR)是一种创新框架,通过将特征相似性图与类别感知预测图在多网络层中对齐,无需改变架构即可提升语义结构和泛化能力。
English Summary: Graph Consistency Regularization (GCR) is a novel framework that enhances feature learning by aligning feature similarity graphs with class-aware prediction graphs across network layers, improving semantic structure and generalization without architectural changes.

Authors:Zhaohua Zhang, Jianhuan Zhuo, Muxi Chen, Chenchen Zhao, Wenyu Jiang, Tianwen Jiang, Mingyang Chen, Yu Tang, Qiuyong Xiao, Jihong Zhang, Zhixun Su
Title: GRAPE: Let GPRO Supervise Query Rewriting by Ranking for Retrieval
Abstract:
The CLIP model has become a cornerstone of large-scale retrieval systems by aligning text and image data in a unified embedding space. Despite its simplicity and efficiency, CLIP struggles when applied to tasks whose input distributions diverge from its training corpus, such as queries with multilingual, long-form, or multimodal differences. To avoid costly retraining, existing methods mainly adopt query-rewriting strategies with large language models (LLMs), aiming to mitigate distribution gaps at the query level. However, due to the lack of supervision signals, LLMs fail to generate the optimal one that fits the training distribution. We address this challenge with GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play enhancement approach that incorporates ranking signals into retrieval-guided query rewriting with LLMs. Intuitively, GRAPE proposes to leverage GRPO to bridge distributional differences -- including length, multilingual, and modality shifts -- by transforming queries into forms better aligned with the retriever's training distribution. However, our preliminary experiment finds that naively finetuning LLM with similarity scores can lead to score inflation, where nearly all candidates are assigned unexpectedly high scores regardless of their true relevance. To address score inflation, we propose a corpus-relative ranking-based reward, which explicitly aligns optimization with ranking metrics while suppressing spurious score inflation. Extensive experiments demonstrate that GRAPE consistently improves retrieval performance under distributional shifts -- including multilingual differences (Flickr30k-CN, CVLUE, XM3600), length differences (Wikipedia), and multimodal differences (CIRR) -- achieving an average improvement of 4.9\% in Recall\@10. The code is available at https://github.com/Chinese0123456/GRAPE.git
Chinese Summary: GRAPE提出了一种即插即用的增强方法,通过引入排序感知优化来提升CLIP模型在分布差异场景下的检索性能,在多语言、长文本和多模态任务中均实现了显著改进而无需重新训练。
English Summary: GRAPE introduces a plug-and-play enhancement that uses ranking-aware optimization to improve CLIP-based retrieval performance under distribution shifts, achieving significant gains across multilingual, length, and modality variations without retraining.

Authors:Wei Zhou, Guoliang Li, Haoyu Wang, Yuxing Han, Xufei Wu, Fan Wu, Xuanhe Zhou
Title: PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation
Abstract:
Large language models (LLMS) have shown increasing effectiveness in Text-to-SQL tasks. However, another closely related problem, Cross-System SQL Translation (a.k.a., SQL-to-SQL), which adapts a query written for one database system (e.g., MySQL) into its equivalent one for another system (e.g., ClickHouse), is of great practical importance but remains underexplored. Existing SQL benchmarks are not well-suited for SQL-to-SQL evaluation, which (1) focus on a limited set of database systems (often just SQLite) and (2) cannot capture many system-specific SQL dialects (e.g., customized functions, data types, and syntax rules). Thus, in this paper, we introduce PARROT, a Practical And Realistic BenchmaRk for CrOss-System SQL Translation. PARROT comprises 598 translation pairs from 38 open-source benchmarks and real-world business services, specifically prepared to challenge system-specific SQL understanding (e.g., LLMS achieve lower than 38.53% accuracy on average). We also provide multiple benchmark variants, including PARROT-Diverse with 28,003 translations (for extensive syntax testing) and PARROT-Simple with 5,306 representative samples (for focused stress testing), covering 22 production-grade database systems. To promote future research, we release a public leaderboard and source code at: https://code4db.github.io/parrot-bench/.
中文: 大语言模型在文本转SQL任务中日益有效,但跨系统SQL翻译这一实际问题仍待探索,因此我们推出PARROT基准测试,包含多样化翻译对以评估系统特定的SQL理解能力。
English: Large language models are increasingly effective for Text-to-SQL tasks, but the practical problem of cross-system SQL translation remains underexplored, prompting the introduction of PARROT, a comprehensive benchmark with diverse translation pairs to evaluate system-specific SQL understanding.

Authors:Andrej Orsula, Matthieu Geist, Miguel Olivares-Mendez, Carol Martinez
Title: Space Robotics Bench: Robot Learning Beyond Earth
Abstract:
The growing ambition for space exploration demands robust autonomous systems that can operate in unstructured environments under extreme extraterrestrial conditions. The adoption of robot learning in this domain is severely hindered by the prohibitive cost of technology demonstrations and the limited availability of data. To bridge this gap, we introduce the Space Robotics Bench, an open-source simulation framework for robot learning in space. It offers a modular architecture that integrates on-demand procedural generation with massively parallel simulation environments to support the creation of vast and diverse training distributions for learning-based agents. To ground research and enable direct comparison, the framework includes a comprehensive suite of benchmark tasks that span a wide range of mission-relevant scenarios. We establish performance baselines using standard reinforcement learning algorithms and present a series of experimental case studies that investigate key challenges in generalization, end-to-end learning, adaptive control, and sim-to-real transfer. Our results reveal insights into the limitations of current methods and demonstrate the utility of the framework in producing policies capable of real-world operation. These contributions establish the Space Robotics Bench as a valuable resource for developing, benchmarking, and deploying the robust autonomous systems required for the final frontier.
中文摘要:Space Robotics Bench 是一个开源仿真框架,旨在通过支持大规模多样化训练与任务基准测试,解决太空机器人技术开发中成本高昂和数据稀缺的问题,并已展现出实际应用潜力。
English Summary: The Space Robotics Bench is an open-source simulation framework designed to overcome the high costs and data scarcity in space robotics by enabling large-scale, diverse training and benchmarking for autonomous systems, with demonstrated real-world applicability.

Authors:Siheng Wang, Zhengdao Li, Yanshu Li, Canran Xiao, Haibo Zhan, Zhengtao Yao, Xuzhi Zhang, Jiale Kang, Linshan Li, Weiming Liu, Zhikang Dong, Jifeng Shen, Junhao Dong, Qiang Sun, Piotr Koniusz
Title: C3-OWD: A Curriculum Cross-modal Contrastive Learning Framework for Open-World Detection
Abstract:
Object detection has advanced significantly in the closed-set setting, but real-world deployment remains limited by two challenges: poor generalization to unseen categories and insufficient robustness under adverse conditions. Prior research has explored these issues separately: visible-infrared detection improves robustness but lacks generalization, while open-world detection leverages vision-language alignment strategy for category diversity but struggles under extreme environments. This trade-off leaves robustness and diversity difficult to achieve simultaneously. To mitigate these issues, we propose \textbf{C3-OWD}, a curriculum cross-modal contrastive learning framework that unifies both strengths. Stage~1 enhances robustness by pretraining with RGBT data, while Stage~2 improves generalization via vision-language alignment. To prevent catastrophic forgetting between two stages, we introduce an Exponential Moving Average (EMA) mechanism that theoretically guarantees preservation of pre-stage performance with bounded parameter lag and function consistency. Experiments on FLIR, OV-COCO, and OV-LVIS demonstrate the effectiveness of our approach: C3-OWD achieves $80.1$ AP$^{50}$ on FLIR, $48.6$ AP$^{50}_{\text{Novel}}$ on OV-COCO, and $35.7$ mAP$_r$ on OV-LVIS, establishing competitive performance across both robustness and diversity evaluations. Code available at: https://github.com/justin-herry/C3-OWD.git.
中文:提出的C3-OWD框架通过课程式跨模态对比学习,将目标检测的鲁棒性与泛化能力相统一,在防止灾难性遗忘的同时,在多个基准测试中实现了优越性能。
English: The proposed C3-OWD framework unifies robustness and generalization in object detection through curriculum cross-modal contrastive learning, achieving competitive performance across diverse benchmarks while preventing catastrophic forgetting via an EMA mechanism.

Authors:Hao Liu, Yongjie Zheng, Yuhan Kang, Mingyang Zhang, Maoguo Gong, Lorenzo Bruzzone
Title: Balanced Diffusion-Guided Fusion for Multimodal Remote Sensing Classification
Abstract:
Deep learning-based techniques for the analysis of multimodal remote sensing data have become popular due to their ability to effectively integrate complementary spatial, spectral, and structural information from different sensors. Recently, denoising diffusion probabilistic models (DDPMs) have attracted attention in the remote sensing community due to their powerful ability to capture robust and complex spatial-spectral distributions. However, pre-training multimodal DDPMs may result in modality imbalance, and effectively leveraging diffusion features to guide complementary diversity feature extraction remains an open question. To address these issues, this paper proposes a balanced diffusion-guided fusion (BDGF) framework that leverages multimodal diffusion features to guide a multi-branch network for land-cover classification. Specifically, we propose an adaptive modality masking strategy to encourage the DDPMs to obtain a modality-balanced rather than spectral image-dominated data distribution. Subsequently, these diffusion features hierarchically guide feature extraction among CNN, Mamba, and transformer networks by integrating feature fusion, group channel attention, and cross-attention mechanisms. Finally, a mutual learning strategy is developed to enhance inter-branch collaboration by aligning the probability entropy and feature similarity of individual subnetworks. Extensive experiments on four multimodal remote sensing datasets demonstrate that the proposed method achieves superior classification performance. The code is available at https://github.com/HaoLiu-XDU/BDGF.
中文: 本文提出了一种平衡扩散引导融合(BDGF)框架,通过自适应模态掩码策略和扩散特征引导多分支网络,解决了多模态遥感数据中的模态不平衡问题,并在土地覆盖分类中实现了优越性能。
English: This paper introduces a balanced diffusion-guided fusion (BDGF) framework that addresses modality imbalance in multimodal remote sensing data by using diffusion features to guide multi-branch networks, achieving superior land-cover classification through adaptive masking and mutual learning strategies.

Authors:Yike Zhu, Boyi Kang, Ziqian Wang, Xingchen Li, Zihan Zhang, Wenjie Li, Longshuai Xiao, Wei Xue, Lei Xie
Title: MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow
Abstract:
Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE.
中文:MeanFlowSE是一种单步生成式语音增强框架,在显著降低计算成本和模型大小的同时,实现了顶尖的感知质量和竞争力强的可懂度,适用于实时应用场景。
English: MeanFlowSE is a one-step generative speech enhancement framework that achieves state-of-the-art perceptual quality and competitive intelligibility with significantly reduced computational cost and model size, making it practical for real-time applications.

Authors:Minsun Jeon, Simon S. Woo
Title: Seeing Through the Blur: Unlocking Defocus Maps for Deepfake Detection
Abstract:
The rapid advancement of generative AI has enabled the mass production of photorealistic synthetic images, blurring the boundary between authentic and fabricated visual content. This challenge is particularly evident in deepfake scenarios involving facial manipulation, but also extends to broader AI-generated content (AIGC) cases involving fully synthesized scenes. As such content becomes increasingly difficult to distinguish from reality, the integrity of visual media is under threat. To address this issue, we propose a physically interpretable deepfake detection framework and demonstrate that defocus blur can serve as an effective forensic signal. Defocus blur is a depth-dependent optical phenomenon that naturally occurs in camera-captured images due to lens focus and scene geometry. In contrast, synthetic images often lack realistic depth-of-field (DoF) characteristics. To capture these discrepancies, we construct a defocus blur map and use it as a discriminative feature for detecting manipulated content. Unlike RGB textures or frequency-domain signals, defocus blur arises universally from optical imaging principles and encodes physical scene structure. This makes it a robust and generalizable forensic cue. Our approach is supported by three in-depth feature analyses, and experimental results confirm that defocus blur provides a reliable and interpretable cue for identifying synthetic images. We aim for our defocus-based detection pipeline and interpretability tools to contribute meaningfully to ongoing research in media forensics. The implementation is publicly available at: https://github.com/irissun9602/Defocus-Deepfake-Detection
中文: 该框架利用散焦模糊(相机拍摄图像中的自然光学现象)作为可物理解读的取证信号来检测AI生成的合成内容,并通过特征分析和实验验证证明了其鲁棒性。
English: The proposed framework leverages defocus blur—a natural optical phenomenon in camera-captured images—as a physically interpretable forensic signal to detect AI-generated synthetic content, demonstrating robustness through feature analyses and experimental validation.

Authors:Sasan Sharifipour, Constantino Álvarez Casado, Le Nguyen, Tharindu Ekanayake, Manuel Lage Cañellas, Nhi Nguyen, Miguel Bordallo López
Title: LiDAR-based Human Activity Recognition through Laplacian Spectral Analysis
Abstract:
Human Activity Recognition supports applications in healthcare, manufacturing, and human-machine interaction. LiDAR point clouds offer a privacy-preserving alternative to cameras and are robust to illumination. We propose a HAR method based on graph spectral analysis. Each LiDAR frame is mapped to a proximity graph (epsilon-graph) and the Laplacian spectrum is computed. Eigenvalues and statistics of eigenvectors form pose descriptors, and temporal statistics over sliding windows yield fixed vectors for classification with support vector machines and random forests. On the MM-Fi dataset with 40 subjects and 27 activities, under a strict subject-independent protocol, the method reaches 94.4% accuracy on a 13-class rehabilitation set and 90.3% on all 27 activities. It also surpasses the skeleton-based baselines reported for MM-Fi. The contribution is a compact and interpretable feature set derived directly from point cloud geometry that provides an accurate and efficient alternative to end-to-end deep learning.
中文: 本研究提出一种基于LiDAR点云的人类活动识别方法,通过构建邻近图并分析其拉普拉斯谱来生成姿态描述符,在MM-Fi数据集上准确率超过90%,同时提供了保护隐私且可解释的深度学习替代方案。
English: This study introduces a human activity recognition method using LiDAR point clouds, which constructs proximity graphs and analyzes their Laplacian spectra to create pose descriptors, achieving over 90% accuracy on the MM-Fi dataset while offering a privacy-preserving and interpretable alternative to deep learning approaches.

Authors:Shamir Matan, Elhadad Osher, Nageris Ben, Mirsky Reuth
Title: Online Dynamic Goal Recognition in Gym Environments
Abstract:
Goal Recognition (GR) is the task of inferring an agent's intended goal from partial observations of its behavior, typically in an online and one-shot setting. Despite recent advances in model-free GR, particularly in applications such as human-robot interaction, surveillance, and assistive systems, the field remains fragmented due to inconsistencies in benchmarks, domains, and evaluation protocols. To address this, we introduce gr-libs (https://github.com/MatanShamir1/gr_libs) and gr-envs (https://github.com/MatanShamir1/gr_envs), two complementary open-source frameworks that support the development, evaluation, and comparison of GR algorithms in Gym-compatible environments. gr-libs includes modular implementations of MDP-based GR baselines, diagnostic tools, and evaluation utilities. gr-envs provides a curated suite of environments adapted for dynamic and goal-directed behavior, along with wrappers that ensure compatibility with standard reinforcement learning toolkits. Together, these libraries offer a standardized, extensible, and reproducible platform for advancing GR research. Both packages are open-source and available on GitHub and PyPI.
中文: 作者推出了两个开源框架gr-libs和gr-envs,通过提供模块化工具和兼容环境来标准化目标识别研究,支持算法的开发与评估。
English: The authors introduce two open-source frameworks, gr-libs and gr-envs, to standardize Goal Recognition research by providing modular tools and compatible environments for developing and evaluating algorithms.

Authors:Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, Jinsong Su
Title: SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts
Abstract:
Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including GSM8K, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL
中文: SPEC-RL框架通过将推测式解码与强化学习过程结合,重用先前训练轮次中的重叠轨迹片段,在数学推理和泛化基准测试中实现2-3倍的训练加速,且不降低策略质量。
English: SPEC-RL is a novel framework that accelerates reinforcement learning with verifiable rewards by reusing overlapping trajectory segments from prior epochs through speculative decoding, reducing rollout time by 2-3x without sacrificing policy quality across various reasoning benchmarks.

Authors:Wenhao Zhang, Shao Zhang, Xihuai Wang, Yang Li, Ying Wen
Title: Towards Monotonic Improvement in In-Context Reinforcement Learning
Abstract:
In-Context Reinforcement Learning (ICRL) has emerged as a promising paradigm for developing agents that can rapidly adapt to new tasks by leveraging past experiences as context, without updating their parameters. Recent approaches train large sequence models on monotonic policy improvement data from online RL, aiming to a continue improved testing time performance. However, our experimental analysis reveals a critical flaw: these models cannot show a continue improvement like the training data during testing time. Theoretically, we identify this phenomenon as Contextual Ambiguity, where the model's own stochastic actions can generate an interaction history that misleadingly resembles that of a sub-optimal policy from the training data, initiating a vicious cycle of poor action selection. To resolve the Contextual Ambiguity, we introduce Context Value into training phase and propose Context Value Informed ICRL (CV-ICRL). CV-ICRL use Context Value as an explicit signal representing the ideal performance theoretically achievable by a policy given the current context. As the context expands, Context Value could include more task-relevant information, and therefore the ideal performance should be non-decreasing. We prove that the Context Value tightens the lower bound on the performance gap relative to an ideal, monotonically improving policy. We fruther propose two methods for estimating Context Value at both training and testing time. Experiments conducted on the Dark Room and Minigrid testbeds demonstrate that CV-ICRL effectively mitigates performance degradation and improves overall ICRL abilities across various tasks and environments. The source code and data of this paper are available at https://github.com/Bluixe/towards_monotonic_improvement .
中文摘要:情境强化学习存在情境模糊性问题,导致模型在测试时无法持续改进,而提出的CV-ICRL方法通过引入情境值来收紧性能界限,在多个测试环境中有效提升了模型表现。
English Summary: In-Context Reinforcement Learning suffers from Contextual Ambiguity where models fail to maintain continuous improvement during testing, which the proposed CV-ICRL method resolves by incorporating Context Value to tighten performance bounds and demonstrate effectiveness across multiple environments.

Authors:Haorui Yu, Ramon Ruiz-Dolz, Qiufeng Yi
Title: A Structured Framework for Evaluating and Enhancing Interpretive Capabilities of Multimodal LLMs in Culturally Situated Tasks
Abstract:
This study aims to test and evaluate the capabilities and characteristics of current mainstream Visual Language Models (VLMs) in generating critiques for traditional Chinese painting. To achieve this, we first developed a quantitative framework for Chinese painting critique. This framework was constructed by extracting multi-dimensional evaluative features covering evaluative stance, feature focus, and commentary quality from human expert critiques using a zero-shot classification model. Based on these features, several representative critic personas were defined and quantified. This framework was then employed to evaluate selected VLMs such as Llama, Qwen, or Gemini. The experimental design involved persona-guided prompting to assess the VLM's ability to generate critiques from diverse perspectives. Our findings reveal the current performance levels, strengths, and areas for improvement of VLMs in the domain of art critique, offering insights into their potential and limitations in complex semantic understanding and content generation tasks. The code used for our experiments can be publicly accessed at: https://github.com/yha9806/VULCA-EMNLP2025.
中文摘要:本研究通过构建基于专家分析的量化评估框架和角色引导测试,评估了主流视觉语言模型在中国传统绘画评论中的表现,揭示了其在艺术批评领域的能力现状与改进空间。
English Summary: This study evaluates mainstream Visual Language Models' ability to critique traditional Chinese paintings by developing a quantitative framework based on expert analysis and persona-guided testing, revealing their current capabilities and limitations in art criticism.

Authors:Xiaowen Ma, Shuning Ge, Fan Yang, Xiangyu Li, Yun Chen, Mengting Ma, Wei Zhang, Zhipeng Liu
Title: TimeExpert: Boosting Long Time Series Forecasting with Temporal Mix of Experts
Abstract:
Transformer-based architectures dominate time series modeling by enabling global attention over all timestamps, yet their rigid 'one-size-fits-all' context aggregation fails to address two critical challenges in real-world data: (1) inherent lag effects, where the relevance of historical timestamps to a query varies dynamically; (2) anomalous segments, which introduce noisy signals that degrade forecasting accuracy. To resolve these problems, we propose the Temporal Mix of Experts (TMOE), a novel attention-level mechanism that reimagines key-value (K-V) pairs as local experts (each specialized in a distinct temporal context) and performs adaptive expert selection for each query via localized filtering of irrelevant timestamps. Complementing this local adaptation, a shared global expert preserves the Transformer's strength in capturing long-range dependencies. We then replace the vanilla attention mechanism in popular time-series Transformer frameworks (i.e., PatchTST and Timer) with TMOE, without extra structural modifications, yielding our specific version TimeExpert and general version TimeExpert-G. Extensive experiments on seven real-world long-term forecasting benchmarks demonstrate that TimeExpert and TimeExpert-G outperform state-of-the-art methods. Code is available at https://github.com/xwmaxwma/TimeExpert.
中文摘要:提出的时序专家混合机制通过动态选择相关时间专家并过滤噪声,解决了Transformer模型在时间序列预测中的固有问题,同时保留全局依赖捕捉能力,在多个预测基准上实现了最优性能。
English Summary: The proposed Temporal Mix of Experts (TMOE) mechanism addresses limitations in Transformer-based time series models by dynamically selecting relevant temporal experts and filtering noise, while maintaining global dependency capture, achieving state-of-the-art performance on forecasting benchmarks.

Authors:Donghao Zhang, Yimin Chen, Kauê TN Duarte, Taha Aslan, Mohamed AlShamrani, Brij Karmur, Yan Wan, Shengcai Chen, Bo Hu, Bijoy K Menon, Wu Qiu
Title: Benchmarking DINOv3 for Multi-Task Stroke Analysis on Non-Contrast CT
Abstract:
Non-contrast computed tomography (NCCT) is essential for rapid stroke diagnosis but is limited by low image contrast and signal to noise ratio. We address this challenge by leveraging DINOv3, a state-of-the-art self-supervised vision transformer, to generate powerful feature representations for a comprehensive set of stroke analysis tasks. Our evaluation encompasses infarct and hemorrhage segmentation, anomaly classification (normal vs. stroke and normal vs. infarct vs. hemorrhage), hemorrhage subtype classification (EDH, SDH, SAH, IPH, IVH), and dichotomized ASPECTS classification (<=6 vs. >6) on multiple public and private datasets. This study establishes strong benchmarks for these tasks and demonstrates the potential of advanced self-supervised models to improve automated stroke diagnosis from NCCT, providing a clear analysis of both the advantages and current constraints of the approach. The code is available at https://github.com/Zzz0251/DINOv3-stroke.
中文: 本研究利用DINOv3自监督视觉变换器改进非增强CT的卒中分析任务,建立了强大基准并展示了其在自动化诊断中的潜力,同时指出了当前方法的局限性。
English: This study utilizes the DINOv3 self-supervised vision transformer to enhance stroke analysis tasks on non-contrast CT, establishing strong benchmarks and demonstrating its potential for automated diagnosis while acknowledging current limitations.

Authors:Haotian Liu, Shuo Wang, Hongteng Xu
Title: C$^2$GSPG: Confidence-calibrated Group Sequence Policy Gradient towards Self-aware Reasoning
Abstract:
Reinforcement Learning (RL) methods, exemplified by Group Relative Policy Optimization (GRPO) and its variants, play a central role in developing reasoning models. However, these methods often suffer from a critical overconfidence issue, which prevents them from achieving self-aware reasoning models. In this study, we propose a simple yet effective confidence-calibration group sequence policy gradient method, called C$^2$GSPG, which simultaneously enhances reasoning performance while suppressing overconfidence. In principle, we propose a Group Sequence Policy Gradient (GSPG) framework for learning reasoning models, which eliminates the token-level bias commonly appearing in GRPO and its variants. In this framework, we define the model confidence for each reasoning problem using the normalized sequence-level probability, and then apply a cross-entropy regularizer to calibrate the model confidence to the sequence's reward. We demonstrate that the confidence calibration regularizer and GSPG are collaborative for binary rewards, as their objectives always share the same gradient direction. For non-binary rewards, we apply nonlinear reward normalization and adaptive regularizer clipping, mitigating the potential conflict between the two objectives. Applying C$^2$GSPG to post-train large language models in logical and mathematical reasoning tasks, we show its superiority over state-of-the-art methods in both reasoning accuracy and confidence calibration. The code of C$^2$GSPG is available at https://github.com/HaotianLiu123/CCGSPG.
Chinese: 本研究提出C²GSPG方法,通过置信度校准的组序列策略梯度,在增强推理性能的同时抑制强化学习模型的过度自信问题,在逻辑和数学推理任务中展现出优于现有方法的准确性与校准能力。
English: This study introduces C²GSPG, a confidence-calibration group sequence policy gradient method that enhances reasoning performance and mitigates overconfidence in reinforcement learning models, demonstrating superior accuracy and calibration in logical and mathematical tasks.

Authors:Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang
Title: RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Abstract:
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors. To address this, we introduce RHYTHM (Reasoning with Hierarchical Temporal Tokenization for Human Mobility), a unified framework that leverages large language models (LLMs) as general-purpose spatio-temporal predictors and trajectory reasoners. Methodologically, RHYTHM employs temporal tokenization to partition each trajectory into daily segments and encode them as discrete tokens with hierarchical attention that captures both daily and weekly dependencies, thereby significantly reducing the sequence length while preserving cyclical information. Additionally, we enrich token representations by adding pre-computed prompt embeddings for trajectory segments and prediction targets via a frozen LLM, and feeding these combined embeddings back into the LLM backbone to capture complex interdependencies. Computationally, RHYTHM freezes the pretrained LLM's backbone to reduce attention complexity and memory cost. We evaluate our model against state-of-the-art methods using three real-world datasets. Notably, RHYTHM achieves a 2.4% improvement in overall accuracy, a 5.0% increase on weekends, and a 24.6% reduction in training time. Code is publicly available at https://github.com/he-h/rhythm.
中文:RHYTHM是一种创新框架,利用大型语言模型通过分层注意力对轨迹进行标记化来预测人类移动,实现了更高的准确性和更快的训练速度。
English: RHYTHM is a novel framework that uses large language models to predict human mobility by tokenizing trajectories with hierarchical attention, achieving higher accuracy and faster training times.

Authors:Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi
Title: Multiplayer Nash Preference Optimization
Abstract:
Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.
Chinese: 本文提出了多人纳什偏好优化(MNPO)框架,将纳什学习从双人博弈扩展到多人场景,通过更丰富的竞争动态更好地对齐复杂非传递性的人类偏好,在异构标注条件下持续超越现有基线模型。
English: This paper introduces Multiplayer Nash Preference Optimization (MNPO), a novel framework that extends Nash learning from human feedback to multiplayer settings, enabling richer competitive dynamics and improved alignment with complex, non-transitive human preferences while consistently outperforming existing baselines.

Authors:Wen Tao, Jing Tang, Alvin Chan, Bryan Hooi, Baolong Bi, Nanyun Peng, Yuansheng Liu, Yiwei Wang
Title: How to Make Large Language Models Generate 100% Valid Molecules?
Abstract:
Molecule generation is key to drug discovery and materials science, enabling the design of novel compounds with specific properties. Large language models (LLMs) can learn to perform a wide range of tasks from just a few examples. However, generating valid molecules using representations like SMILES is challenging for LLMs in few-shot settings. In this work, we explore how LLMs can generate 100% valid molecules. We evaluate whether LLMs can use SELFIES, a representation where every string corresponds to a valid molecule, for valid molecule generation but find that LLMs perform worse with SELFIES than with SMILES. We then examine LLMs' ability to correct invalid SMILES and find their capacity limited. Finally, we introduce SmiSelf, a cross-chemical language framework for invalid SMILES correction. SmiSelf converts invalid SMILES to SELFIES using grammatical rules, leveraging SELFIES' mechanisms to correct the invalid SMILES. Experiments show that SmiSelf ensures 100% validity while preserving molecular characteristics and maintaining or even enhancing performance on other metrics. SmiSelf helps expand LLMs' practical applications in biomedicine and is compatible with all SMILES-based generative models. Code is available at https://github.com/wentao228/SmiSelf.
中文摘要:本研究提出SmiSelf框架,通过语法规则将无效SMILES转换为SELFIES表示,实现了100%有效分子生成,同时保持分子特性并提升其他性能指标,拓展了大语言模型在生物医学领域的实际应用。
English Summary: This study introduces SmiSelf, a cross-chemical framework that converts invalid SMILES to SELFIES using grammatical rules to achieve 100% valid molecule generation while preserving molecular characteristics and enhancing performance metrics.

Authors:Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang
Title: d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching
Abstract:
Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.
中文: 提出的d²Cache框架通过双阶段自适应KV缓存机制,无需重新训练即可显著提升扩散大语言模型的推理速度与生成质量。
English: The proposed d²Cache framework enhances diffusion-based large language models by implementing a training-free dual adaptive KV cache that significantly boosts inference speed and generation quality without requiring model retraining.

Authors:Yutao Shen, Junkun Yuan, Toru Aonishi, Hideki Nakayama, Yue Ma
Title: Follow-Your-Preference: Towards Preference-Aligned Image Inpainting
Abstract:
This paper investigates image inpainting with preference alignment. Instead of introducing a novel method, we go back to basics and revisit fundamental problems in achieving such alignment. We leverage the prominent direct preference optimization approach for alignment training and employ public reward models to construct preference training datasets. Experiments are conducted across nine reward models, two benchmarks, and two baseline models with varying structures and generative algorithms. Our key findings are as follows: (1) Most reward models deliver valid reward scores for constructing preference data, even if some of them are not reliable evaluators. (2) Preference data demonstrates robust trends in both candidate scaling and sample scaling across models and benchmarks. (3) Observable biases in reward models, particularly in brightness, composition, and color scheme, render them susceptible to cause reward hacking. (4) A simple ensemble of these models yields robust and generalizable results by mitigating such biases. Built upon these observations, our alignment models significantly outperform prior models across standard metrics, GPT-4 assessments, and human evaluations, without any changes to model structures or the use of new datasets. We hope our work can set a simple yet solid baseline, pushing this promising frontier. Our code is open-sourced at: https://github.com/shenytzzz/Follow-Your-Preference.
本研究通过应用直接偏好优化与公共奖励模型重新审视图像修复中的偏好对齐,揭示了奖励模型的偏差并证明简单的集成方法能有效缓解这些问题,从而在多项评估中实现更优性能。
This study revisits image inpainting preference alignment by applying direct preference optimization with public reward models, revealing their biases and demonstrating that a simple ensemble approach effectively mitigates these issues to achieve superior performance across multiple evaluations.

Authors:Ben Liang, Yuan Liu, Bingwen Qiu, Yihong Wang, Xiubao Sui, Qian Chen
Title: FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection
Abstract:
Aerial-view object detection is a critical technology for real-world applications such as natural resource monitoring, traffic management, and UAV-based search and rescue. Detecting tiny objects in high-resolution aerial imagery presents a long-standing challenge due to their limited visual cues and the difficulty of modeling global context in complex scenes. Existing methods are often hampered by delayed contextual fusion and inadequate non-linear modeling, failing to effectively use global information to refine shallow features and thus encountering a performance bottleneck. To address these challenges, we propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection. First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception in shallow features while preserving fine-grained details, and employs Kolmogorov-Arnold networks to achieve adaptive non-linear modeling of multi-scale dependencies. Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction. Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to to balance detail preservation and global enhancement. Extensive experiments on benchmark aerial-view datasets demonstrate that FMC-DETR achieves state-of-the-art performance with fewer parameters. On the challenging VisDrone dataset, our model achieves improvements of 6.5% AP and 8.2% AP50 over the baseline, highlighting its effectiveness in tiny object detection. The code can be accessed at https://github.com/bloomingvision/FMC-DETR.
Chinese: FMC-DETR提出了一种具有频率解耦融合的新型框架和Wavelet Kolmogorov-Arnold Transformer骨干网络,以增强全局上下文感知和自适应非线性建模,在航空视角微小物体检测中以更少的参数实现了最先进的性能。
English: FMC-DETR introduces a novel framework with frequency-decoupled fusion and a Wavelet Kolmogorov-Arnold Transformer backbone to enhance global context perception and adaptive non-linear modeling, achieving state-of-the-art performance in aerial-view tiny object detection with fewer parameters.

Authors:Zijian Wang, Xiaofei Zhang, Xin Zhang, Yukun Liu, Qiong Zhang
Title: Beyond Aggregation: Guiding Clients in Heterogeneous Federated Learning
Abstract:
Federated learning (FL) is increasingly adopted in domains like healthcare, where data privacy is paramount. A fundamental challenge in these systems is statistical heterogeneity-the fact that data distributions vary significantly across clients (e.g., different hospitals may treat distinct patient demographics). While current FL algorithms focus on aggregating model updates from these heterogeneous clients, the potential of the central server remains under-explored. This paper is motivated by a healthcare scenario: could a central server not only build a model but also guide a new patient to the hospital best equipped for their specific condition? We generalize this idea to propose a novel paradigm for FL systems where the server actively guides the allocation of new tasks or queries to the most appropriate client in the network. To enable this, we introduce an empirical likelihood-based framework that simultaneously addresses two goals: (1) learning effective local models on each client, and (2) finding the best matching client for a new query. Empirical results demonstrate the framework's effectiveness on benchmark datasets, showing improvements in both model accuracy and the precision of client guidance compared to standard FL approaches. This work opens a new direction for building more intelligent and resource-efficient federated systems that leverage heterogeneity as a feature, not just a bug. Code is available at https://github.com/zijianwang0510/FedDRM.git.
中文摘要:本文提出了一种新颖的联邦学习范式,其中中央服务器不仅聚合模型,还通过经验似然框架智能地将新查询引导至最合适的客户端,从而在模型精度和客户端匹配准确性上实现双重提升。
English Summary: This paper introduces a novel federated learning paradigm where the central server not only aggregates models but also intelligently directs new queries to the most suitable client, using an empirical likelihood framework to improve both model accuracy and client matching precision.

Authors:Ye-eun Kim, Suhyeon Lim, Andrew J. Choi
Title: MMeViT: Multi-Modal ensemble ViT for Post-Stroke Rehabilitation Action Recognition
Abstract:
Rehabilitation therapy for stroke patients faces a supply shortage despite the increasing demand. To address this issue, remote monitoring systems that reduce the burden on medical staff are emerging as a viable alternative. A key component of these remote monitoring systems is Human Action Recognition (HAR) technology, which classifies actions. However, existing HAR studies have primarily focused on non-disable individuals, making them unsuitable for recognizing the actions of stroke patients. HAR research for stroke has largely concentrated on classifying relatively simple actions using machine learning rather than deep learning. In this study, we designed a system to monitor the actions of stroke patients, focusing on domiciliary upper limb Activities of Daily Living (ADL). Our system utilizes IMU (Inertial Measurement Unit) sensors and an RGB-D camera, which are the most common modalities in HAR. We directly collected a dataset through this system, investigated an appropriate preprocess and proposed a deep learning model suitable for processing multimodal data. We analyzed the collected dataset and found that the action data of stroke patients is less clustering than that of non-disabled individuals. Simultaneously, we found that the proposed model learns similar tendencies for each label in data with features that are difficult to clustering. This study suggests the possibility of expanding the deep learning model, which has learned the action features of stroke patients, to not only simple action recognition but also feedback such as assessment contributing to domiciliary rehabilitation in future research. The code presented in this study is available at https://github.com/ye-Kim/MMeViT.
中文: 本研究开发了一种基于惯性测量单元传感器和RGB-D摄像头的多模态深度学习系统,用于识别脑卒中患者的上肢日常活动,解决了现有动作识别模型不适用于该人群的局限性,为远程康复监测提供了潜在应用前景。
English: This study develops a multimodal deep learning system using IMU sensors and an RGB-D camera to recognize upper limb daily activities in stroke patients, addressing the limitations of existing human action recognition models that are unsuitable for this population and enabling potential applications in remote rehabilitation monitoring.

Authors:Zi Liang, Qingqing Ye, Xuan Liu, Yanyun Wang, Jianliang Xu, Haibo Hu
Title: Virus Infection Attack on LLMs: Your Poisoning Can Spread "VIA" Synthetic Data
Abstract:
Synthetic data refers to artificial samples generated by models. While it has been validated to significantly enhance the performance of large language models (LLMs) during training and has been widely adopted in LLM development, potential security risks it may introduce remain uninvestigated. This paper systematically evaluates the resilience of synthetic-data-integrated training paradigm for LLMs against mainstream poisoning and backdoor attacks. We reveal that such a paradigm exhibits strong resistance to existing attacks, primarily thanks to the different distribution patterns between poisoning data and queries used to generate synthetic samples. To enhance the effectiveness of these attacks and further investigate the security risks introduced by synthetic data, we introduce a novel and universal attack framework, namely, Virus Infection Attack (VIA), which enables the propagation of current attacks through synthetic data even under purely clean queries. Inspired by the principles of virus design in cybersecurity, VIA conceals the poisoning payload within a protective "shell" and strategically searches for optimal hijacking points in benign samples to maximize the likelihood of generating malicious content. Extensive experiments on both data poisoning and backdoor attacks show that VIA significantly increases the presence of poisoning content in synthetic data and correspondingly raises the attack success rate (ASR) on downstream models to levels comparable to those observed in the poisoned upstream models.
中文摘要:合成数据虽能显著提升大语言模型性能,但其潜在安全风险尚未被探究;本研究发现该训练范式对主流攻击具有强抵抗力,并提出新型病毒感染攻击(VIA),能通过合成数据有效传播恶意内容,大幅提升攻击成功率。
English Summary: Synthetic data enhances LLM performance but poses unexamined security risks, with this study revealing its resilience to standard attacks due to distribution differences while introducing the Virus Infection Attack (VIA) that effectively propagates malicious content through synthetic data.

Authors:Gabriel A. Viana, Luis F. Alves Pereira, Tsang Ing Ren, George D. C. Cavalcanti, Jan Sijbers
Title: Perceptual Influence: Improving the Perceptual Loss Design for Low-Dose CT Enhancement
Abstract:
Perceptual losses have emerged as powerful tools for training networks to enhance Low-Dose Computed Tomography (LDCT) images, offering an alternative to traditional pixel-wise losses such as Mean Squared Error, which often lead to over-smoothed reconstructions and loss of clinically relevant details in LDCT images. The perceptual losses operate in a latent feature space defined by a pretrained encoder and aim to preserve semantic content by comparing high-level features rather than raw pixel values. However, the design of perceptual losses involves critical yet underexplored decisions, including the feature representation level, the dataset used to pretrain the encoder, and the relative importance assigned to the perceptual component during optimization. In this work, we introduce the concept of perceptual influence (a metric that quantifies the relative contribution of the perceptual loss term to the total loss) and propose a principled framework to assess the impact of the loss design choices on the model training performance. Through systematic experimentation, we show that the widely used configurations in the literature to set up a perceptual loss underperform compared to better-designed alternatives. Our findings show that better perceptual loss designs lead to significant improvements in noise reduction and structural fidelity of reconstructed CT images, without requiring any changes to the network architecture. We also provide objective guidelines, supported by statistical analysis, to inform the effective use of perceptual losses in LDCT denoising. Our source code is available at https://github.com/vngabriel/perceptual-influence.
感知损失通过特征空间比较保留语义内容以提升低剂量CT图像重建效果,本研究提出系统性评估框架,证明优化后的损失设计能在不改变网络架构的情况下显著提升噪声抑制与结构保真度。
Perceptual losses enhance LDCT image reconstruction by preserving semantic content through feature-level comparisons, and our study introduces a principled framework that demonstrates optimized loss designs significantly improve noise reduction and structural fidelity without altering network architecture.

Authors:Davi Bastos Costa, Renato Vicente
Title: Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Abstract:
Mafia is a social deduction game where informed mafia compete against uninformed townsfolk. Its asymmetry of information and reliance on theory-of-mind reasoning mirror real-world multi-agent scenarios, making it a useful testbed for evaluating the social intelligence of large language models (LLMs). To support a systematic study, we introduce Mini-Mafia: a simplified four-player variant with one mafioso, one detective, and two villagers. We set the mafioso to kill a villager and the detective to investigate the mafioso during the night, reducing the game to a single day phase of discussion and voting. This setup isolates three interactive capabilities through role-specific win conditions: the mafioso must deceive, the villagers must detect deception, and the detective must effectively disclose information. To measure these skills, we have LLMs play against each other, creating the Mini-Mafia Benchmark: a two-stage framework that first estimates win rates within fixed opponent configurations, then aggregates performance across them using standardized scoring. Built entirely from model interactions without external data, the benchmark evolves as new models are introduced, with each one serving both as a new opponent and as a subject of evaluation. Our experiments reveal counterintuitive results, including cases where smaller models outperform larger ones. Beyond benchmarking, Mini-Mafia enables quantitative study of emergent multi-agent dynamics such as name bias and last-speaker advantage. It also contributes to AI safety by generating training data for deception detectors and by tracking models' deception capabilities against human baselines.
中文摘要:Mini-Mafia作为简化版社交推理游戏,通过设计特定角色获胜条件构建评估框架,既能检测语言模型在欺骗识别与信息传递方面的社交智能,又能为多智能体动态研究和AI安全提供量化分析基础。
English Summary: Mini-Mafia is a simplified social deduction game designed as a benchmark to evaluate large language models' social intelligence through deception detection and information disclosure, revealing unexpected performance patterns among different models.

Authors:Zhiqiang Tian, Weigang Li, Chunhua Deng, Junwei Hu, Yongqiang Wang, Wenping Liu
Title: Desensitizing for Improving Corruption Robustness in Point Cloud Classification through Adversarial Training
Abstract:
Due to scene complexity, sensor inaccuracies, and processing imprecision, point cloud corruption is inevitable. Over-reliance on input features is the root cause of DNN vulnerabilities. It remains unclear whether this issue exists in 3D tasks involving point clouds and whether reducing dependence on these features can enhance the model's robustness to corrupted point clouds. This study attempts to answer these questions. Specifically, we quantified the sensitivity of the DNN to point cloud features using Shapley values and found that models trained using traditional methods exhibited high sensitivity values for certain features. Furthermore, under an equal pruning ratio, prioritizing the pruning of highly sensitive features causes more severe damage to model performance than random pruning. We propose `Desensitized Adversarial Training' (DesenAT), generating adversarial samples using feature desensitization and conducting training within a self-distillation framework, which aims to alleviate DNN's over-reliance on point clouds features by smoothing sensitivity. First, data points with high contribution components are eliminated, and spatial transformation is used to simulate corruption scenes, generate adversarial samples, and conduct adversarial training on the model. Next, to compensate for information loss in adversarial samples, we use the self-distillation method to transfer knowledge from clean samples to adversarial samples, and perform adversarial training in a distillation manner.Extensive experiments on ModelNet-C and PointCloud-C demonstrate show that the propose method can effectively improve the robustness of the model without reducing the performance of clean data sets. This code is publicly available at \href{https://github.com/JerkyT/DesenAT/tree/master}{https://github.com/JerkyT/DesenAT}.
Chinese: 本研究提出去敏感对抗训练(DesenAT),通过生成对抗样本并采用自蒸馏方法,减轻深度神经网络对特定点云特征的过度依赖,从而在保持干净数据集性能的同时,有效提升模型对受损点云的鲁棒性。
English: This study introduces Desensitized Adversarial Training (DesenAT), a method that reduces deep neural networks' over-reliance on specific point cloud features by generating adversarial samples and using self-distillation, thereby enhancing model robustness against corrupted point clouds without compromising performance on clean datasets.

Authors:Lorenz K. Müller, Philippe Bich, Jiawei Zhuang, Ahmet Çelik, Luca Benfenati, Lukas Cavigelli
Title: SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights
Abstract:
Post-training quantization has emerged as the most widely used strategy for deploying large language models at low precision. Still, current methods show perplexity degradation at bit-widths less than or equal to 4, partly because representing outliers causes precision issues in parameters that share the same scales as these outliers. This problem is especially pronounced for calibration-free, uniform quantization methods. We introduce SINQ to augment existing post-training quantizers with an additional second-axis scale factor and a fast Sinkhorn-Knopp-style algorithm that finds scales to normalize per-row and per-column variances, thereby minimizing a novel per-matrix proxy target for quantization: the matrix imbalance. Our method has no interactions between layers and can be trivially applied to new architectures to quantize any linear layers. We evaluate our method on the Qwen3 model family and DeepSeek-V2.5. SINQ improves WikiText2 and C4 perplexity significantly against uncalibrated uniform quantization baselines and can be further enhanced by combining it with calibration and non-uniform quantization levels. Code to reproduce the results of this work and to easily quantize models using SINQ is available at https://github.com/huawei-csl/SINQ.
中文: SINQ通过引入第二轴尺度因子和采用Sinkhorn-Knopp算法归一化方差,有效提升了后训练量化在低比特宽度下的性能,显著改善了Qwen3和DeepSeek-V2.5等模型的困惑度,且无需层间交互。
English: SINQ enhances post-training quantization by adding a second-axis scale factor and using a Sinkhorn-Knopp algorithm to normalize variances, significantly improving perplexity in models like Qwen3 and DeepSeek-V2.5 at low bit-widths without layer interactions.

Authors:Sergiu Bursuc, Theodore Ehrenborg, Shaowei Lin, Lacramioara Astefanoaei, Ionel Emilian Chiosa, Jure Kukovec, Alok Singh, Oliver Butterley, Adem Bizid, Quinn Dougherty, Miranda Zhao, Max Tan, Max Tegmark
Title: A benchmark for vericoding: formally verified program synthesis
Abstract:
We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark
中文: 本研究推出了最大的验证编码基准,测试了大型语言模型基于Dafny、Verus/Rust和Lean规范生成形式验证代码的能力,成功率因语言而异,且自然语言描述未显著提升性能。
English: This study introduces the largest benchmark for vericoding, testing LLMs on generating formally verified code from specifications across Dafny, Verus/Rust, and Lean, with success rates varying by language and no significant improvement from natural language descriptions.

Authors:Haochen Gong, Zhen Tao, Shidong Pan, Zhenchang Xing, Xiaoyu Sun
Title: Towards Context-aware Mobile Privacy Notice: Implementation of A Deployable Contextual Privacy Policies Generator
Abstract:
Lengthy and legally phrased privacy policies impede users' understanding of how mobile applications collect and process personal data. Prior work proposed Contextual Privacy Policies (CPPs) for mobile apps to display shorter policy snippets only in the corresponding user interface contexts, but the pipeline could not be deployable in real-world mobile environments. In this paper, we present PrivScan, the first deployable CPP Software Development Kit (SDK) for Android. It captures live app screenshots to identify GUI elements associated with types of personal data and displays CPPs in a concise, user-facing format. We provide a lightweight floating button that offers low-friction, on-demand control. The architecture leverages remote deployment to decouple the multimodal backend pipeline from a mobile client comprising five modular components, thereby reducing on-device resource demands and easing cross-platform portability. A feasibility-oriented evaluation shows an average execution time of 9.15\,s, demonstrating the practicality of our approach. The source code of PrivScan is available at https://github.com/buyanghc/PrivScan and the demo video can be found at https://www.youtube.com/watch?v=ck-25otfyHc.
Chinese: 本文提出PrivScan,首个可部署的Android SDK,通过实时截屏识别与个人数据相关的界面元素,以浮动按钮形式简洁展示情境化隐私政策,9.15秒的平均执行时间验证了方案的可行性。
English: This paper introduces PrivScan, a deployable Android SDK that captures live app screenshots to identify GUI elements linked to personal data and displays concise contextual privacy policies via a floating button, with an average execution time of 9.15 seconds proving its feasibility.

Authors:Federico Chinello, Giacomo Boracchi
Title: Convolutional Set Transformer
Abstract:
We introduce the Convolutional Set Transformer (CST), a novel neural architecture designed to process image sets of arbitrary cardinality that are visually heterogeneous yet share high-level semantics - such as a common category, scene, or concept. Existing set-input networks, e.g., Deep Sets and Set Transformer, are limited to vector inputs and cannot directly handle 3D image tensors. As a result, they must be cascaded with a feature extractor, typically a CNN, which encodes images into embeddings before the set-input network can model inter-image relationships. In contrast, CST operates directly on 3D image tensors, performing feature extraction and contextual modeling simultaneously, thereby enabling synergies between the two processes. This design yields superior performance in tasks such as Set Classification and Set Anomaly Detection and further provides native compatibility with CNN explainability methods such as Grad-CAM, unlike competing approaches that remain opaque. Finally, we show that CSTs can be pre-trained on large-scale datasets and subsequently adapted to new domains and tasks through standard Transfer Learning schemes. To support further research, we release CST-15, a CST backbone pre-trained on ImageNet (https://github.com/chinefed/convolutional-set-transformer).
中文摘要:卷积集合变换器(CST)是一种新型神经网络架构,可直接处理三维图像张量组成的异构图像集,通过同步实现特征提取与上下文建模,在集合分类等任务中性能优于现有方法,并保持与CNN可解释性方法的兼容性。
English Summary: The Convolutional Set Transformer (CST) is a novel neural architecture that directly processes heterogeneous image sets as 3D tensors, integrating feature extraction and contextual modeling to outperform existing methods in tasks like set classification while maintaining compatibility with CNN explainability techniques.

Authors:Komal Kumar, Rao Muhammad Anwer, Fahad Shahbaz Khan, Salman Khan, Ivan Laptev, Hisham Cholakkal
Title: DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
Abstract:
Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on \href{https://github.com/MAXNORM8650/DEFT}{DEFTBase}.
Chinese: DEFT是一种高效微调框架,通过将权重更新分解为两个低秩组件,在个性化学习、任务统一和可编辑性之间实现最佳平衡,并在多个数据集上取得了最先进的性能。
English: DEFT is an efficient fine-tuning framework that decomposes weight updates into two low-rank components, enabling optimal balance between personalization, task unification, and editability while achieving state-of-the-art performance across multiple datasets.

Authors:Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk, Maxim Minets, Daria Ozerova, Emil Sataev, Denis Zuenko, Andrey E. Ustyuzhanin
Title: ML2B: Multi-Lingual ML Benchmark For AutoML
Abstract:
Large language models (LLMs) have recently demonstrated strong capabilities in generating machine learning (ML) code, enabling end-to-end pipeline construction from natural language instructions. However, existing benchmarks for ML code generation are mainly restricted to English, overlooking the global and multilingual nature of ML research and practice. To address this gap, we present ML2B, the first benchmark for evaluating multilingual ML code generation. ML2B consists of 30 Kaggle competitions translated into 13 natural languages, covering tabular, text, and image data types, with structured metadata and validated human-reviewed translations. For evaluation, we employ AIDE, an automated framework for end-to-end assessment of data science pipelines, and provide insights into cross-lingual model performance. Our results reveal substantial 15-45% performance degradation on non-English tasks, highlighting critical challenges in multilingual representation learning for code generation. The benchmark, evaluation framework, and comprehensive results are made available through our GitHub repository to facilitate future research in multilingual ML code generation: https://github.com/enaix/ml2b.
中文摘要:ML2B作为首个多语言机器学习代码生成基准,通过评估发现非英语任务的性能相比英语任务显著下降15-45%,凸显了多语言代码生成面临的关键挑战。
English Summary: The ML2B benchmark is introduced as the first multilingual evaluation tool for machine learning code generation, revealing significant performance drops of 15-45% on non-English tasks compared to English ones.

Authors:Le Zhang, Ao Li, Qibin Hou, Ce Zhu, Yonina C. Eldar
Title: Deep Learning Empowered Super-Resolution: A Comprehensive Survey and Future Prospects
Abstract:
Super-resolution (SR) has garnered significant attention within the computer vision community, driven by advances in deep learning (DL) techniques and the growing demand for high-quality visual applications. With the expansion of this field, numerous surveys have emerged. Most existing surveys focus on specific domains, lacking a comprehensive overview of this field. Here, we present an in-depth review of diverse SR methods, encompassing single image super-resolution (SISR), video super-resolution (VSR), stereo super-resolution (SSR), and light field super-resolution (LFSR). We extensively cover over 150 SISR methods, nearly 70 VSR approaches, and approximately 30 techniques for SSR and LFSR. We analyze methodologies, datasets, evaluation protocols, empirical results, and complexity. In addition, we conducted a taxonomy based on each backbone structure according to the diverse purposes. We also explore valuable yet under-studied open issues in the field. We believe that this work will serve as a valuable resource and offer guidance to researchers in this domain. To facilitate access to related work, we created a dedicated repository available at https://github.com/AVC2-UESTC/Holistic-Super-Resolution-Review.
中文摘要:本综述全面分析了150多种单图像、70种视频及30种立体/光场超分辨率方法,通过方法论比较和未充分研究方向的探讨,为研究人员提供了该领域的重要参考资源。
English Summary: This comprehensive survey provides an in-depth analysis of over 150 single image, 70 video, and 30 stereo/light field super-resolution methods, offering methodological comparisons and identifying under-studied research directions to serve as a key resource for researchers.

Authors:Ha-Hieu Pham, Minh Le, Han Huynh, Nguyen Quoc Khanh Le, Huy-Hieu Pham
Title: Graph-Theoretic Consistency for Robust and Topology-Aware Semi-Supervised Histopathology Segmentation
Abstract:
Semi-supervised semantic segmentation (SSSS) is vital in computational pathology, where dense annotations are costly and limited. Existing methods often rely on pixel-level consistency, which propagates noisy pseudo-labels and produces fragmented or topologically invalid masks. We propose Topology Graph Consistency (TGC), a framework that integrates graph-theoretic constraints by aligning Laplacian spectra, component counts, and adjacency statistics between prediction graphs and references. This enforces global topology and improves segmentation accuracy. Experiments on GlaS and CRAG demonstrate that TGC achieves state-of-the-art performance under 5-10% supervision and significantly narrows the gap to full supervision. Code is available at https://github.com/hieuphamha19/TGC.
Chinese: 提出的拓扑图一致性(TGC)框架通过施加图论约束来保持全局拓扑结构,从而在半监督语义分割中取得优异表现,在少量监督下达到最先进水平,并显著缩小与全监督之间的差距。
English: The proposed Topology Graph Consistency (TGC) framework enhances semi-supervised semantic segmentation by enforcing graph-theoretic constraints to maintain global topology, achieving state-of-the-art performance with minimal supervision and narrowing the gap to full supervision on medical datasets.

Authors:Yash Thube
Title: Pathological Truth Bias in Vision-Language Models
Abstract:
Vision Language Models (VLMs) are improving quickly, but standard benchmarks can hide systematic failures that reduce real world trust. We introduce MATS (Multimodal Audit for Truthful Spatialization), a compact behavioral audit that measures whether models reject visually contradicted statements, and two metrics Spatial Consistency Score (SCS) and Incorrect Agreement Rate (IAR). Instruction tuned generative VLMs (LLaVA 1.5, QwenVLchat) exhibit very low SCS and high IAR, while contrastive encoders (CLIP, SigLIP) are far more robust. Activation patching causally localizes failure loci (mid to late cross attention for generative models, pooled projection components for contrastive models) and suggests concrete repair paths.
中文: 视觉语言模型在拒绝视觉矛盾陈述方面存在系统性缺陷,MATS审计通过空间一致性评分和不正确同意率量化了生成模型的不足,并借助激活修补定位了可修复的故障节点。
English: Vision Language Models often fail to reject visually contradicted statements, as revealed by the MATS audit, which identifies systematic weaknesses in generative models and suggests targeted repairs through activation patching.

Authors:Long Xing, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin
Title: CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
Abstract:
Image captioning is a fundamental task that bridges the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable data annotated by humans or proprietary models. This approach often leads to models that memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome the limitation of SFT, we propose applying the Reinforcement Learning with Verifiable Rewards (RLVR) paradigm to the open-ended task of image captioning. A primary challenge, however, is designing an objective reward function for the inherently subjective nature of what constitutes a "good" caption. We introduce Captioning Reinforcement Learning (CapRL), a novel training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding image. CapRL employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. As the first study to apply RLVR to the subjective image captioning task, we demonstrate that CapRL significantly enhances multiple settings. Pretraining on the CapRL-5M caption dataset annotated by CapRL-3B results in substantial gains across 12 benchmarks. Moreover, within the Prism Framework for caption quality evaluation, CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding the baseline by an average margin of 8.4%. Code is available here: https://github.com/InternLM/CapRL.
中文总结:该研究提出CapRL强化学习框架,通过评估描述能否帮助语言模型准确回答图像相关问题来定义描述质量,从而突破监督微调的限制。
English Summary: The study introduces CapRL, a reinforcement learning framework that overcomes limitations of supervised fine-tuning by defining caption quality through a caption's ability to help language models answer image-related questions accurately.

Authors:Renjie Luo, Zichen Liu, Xiangyan Liu, Chao Du, Min Lin, Wenhu Chen, Wei Lu, Tianyu Pang
Title: Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Abstract:
LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.
中文摘要:作者提出了一种反馈条件策略(FCP),将语言反馈作为语言模型的调节信号,通过离线训练和在线自举直接从响应-反馈对中学习,将反馈驱动学习重新定义为条件生成而非奖励优化。
English Summary: The authors propose a feedback-conditional policy (FCP) that treats verbal feedback as a conditioning signal for language models, enabling direct learning from response-feedback pairs through both offline training and online bootstrapping, reframing feedback-driven learning as conditional generation rather than reward optimization.

Authors:Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang
Title: Variational Reasoning for Language Models
Abstract:
We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
中文摘要:本文提出了一种变分推理框架,将强化学习方法与变分推断相统一,通过稳定的训练目标提升语言模型推理能力,并揭示了模型对简单问题的内在偏好。
English Summary: This paper presents a variational reasoning framework that unifies variational inference with reinforcement learning methods to enhance language model reasoning through stable training objectives and reveals an inherent bias toward easier questions.

Authors:Alexandre Lopes, Roberto Souza, Helio Pedrini
Title: CCNeXt: An Effective Self-Supervised Stereo Depth Estimation Approach
Abstract:
Depth Estimation plays a crucial role in recent applications in robotics, autonomous vehicles, and augmented reality. These scenarios commonly operate under constraints imposed by computational power. Stereo image pairs offer an effective solution for depth estimation since it only needs to estimate the disparity of pixels in image pairs to determine the depth in a known rectified system. Due to the difficulty in acquiring reliable ground-truth depth data across diverse scenarios, self-supervised techniques emerge as a solution, particularly when large unlabeled datasets are available. We propose a novel self-supervised convolutional approach that outperforms existing state-of-the-art Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) while balancing computational cost. The proposed CCNeXt architecture employs a modern CNN feature extractor with a novel windowed epipolar cross-attention module in the encoder, complemented by a comprehensive redesign of the depth estimation decoder. Our experiments demonstrate that CCNeXt achieves competitive metrics on the KITTI Eigen Split test data while being 10.18$\times$ faster than the current best model and achieves state-of-the-art results in all metrics in the KITTI Eigen Split Improved Ground Truth and Driving Stereo datasets when compared to recently proposed techniques. To ensure complete reproducibility, our project is accessible at \href{https://github.com/alelopes/CCNext}{\texttt{https://github.com/alelopes/CCNext}}.
中文: 提出的CCNeXt架构采用自监督卷积方法,在深度估计任务中超越了现有CNN和ViT模型,以显著提升的计算速度在多个数据集上实现了最优性能。
English: The proposed CCNeXt architecture introduces a self-supervised convolutional approach that surpasses existing CNNs and ViTs in depth estimation, achieving state-of-the-art results with significantly faster computational speed across multiple datasets.

Authors:Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
Title: SPARK: Synergistic Policy And Reward Co-Evolving Framework
Abstract:
Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.
中文摘要:SPARK框架通过回收利用训练过程中的数据和正确性信号,协同优化策略与生成式奖励模型,无需依赖昂贵的人工反馈,在多项基准测试中实现了显著的性能提升。
English Summary: The SPARK framework efficiently recycles rollout and correctness data to co-evolve both the policy and a generative reward model, eliminating the need for costly human feedback and achieving significant performance gains across various benchmarks.

Authors:Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen
Title: LongLive: Real-time Interactive Long Video Generation
Abstract:
We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.
中文: LongLive是一种用于实时生成长视频的自回归框架,通过KV重缓存和流式长视频调优解决了效率与质量难题,实现了高速生成并支持交互式提示输入。
English: LongLive is an autoregressive framework for real-time long video generation that overcomes efficiency and quality challenges through KV-recache and streaming long tuning, achieving high-speed performance and supporting interactive prompt inputs.

Authors:Dmitri Volkov, Yafei Yang, Chung-chieh Shan
Title: Committing to the bit: Relational programming with semiring arrays and SAT solving
Abstract:
We propose semiringKanren, a relational programming language where each relation expression denotes a semiring array. We formalize a type system that restricts the arrays to finite size. We then define a semantics that is parameterized by the semiring that the arrays draw their elements from. We compile semiringKanren types to bitstring representations. For the Boolean semiring, this compilation enables us to use an SAT solver to run semiringKanren programs efficiently. We compare the performance of semiringKanren and faster miniKanren for solving Sudoku puzzles. Our experiment shows that semiringKanren can be a more efficient variant of miniKanren.
Chinese: semiringKanren 是一种关系型编程语言,它采用半环数组并将类型编译为位串,通过布尔半环使用SAT求解器实现高效执行,在解决数独问题时展现出优于miniKanren的性能。
English: semiringKanren is a relational programming language that uses semiring arrays and compiles types to bitstrings, enabling efficient execution with SAT solvers for the Boolean semiring and showing improved performance over miniKanren in Sudoku solving.

Authors:Katsuhiko Hayashi, Hidetaka Kamigaito
Title: From Formal Language Theory to Statistical Learning: Finite Observability of Subregular Languages
Abstract:
We prove that all standard subregular language classes are linearly separable when represented by their deciding predicates. This establishes finite observability and guarantees learnability with simple linear models. Synthetic experiments confirm perfect separability under noise-free conditions, while real-data experiments on English morphology show that learned features align with well-known linguistic constraints. These results demonstrate that the subregular hierarchy provides a rigorous and interpretable foundation for modeling natural language structure. Our code used in real-data experiments is available at https://github.com/UTokyo-HayashiLab/subregular.
中文: 该研究证明所有标准次正则语言类通过其判定谓词均可线性分离,确保了有限可观测性和线性模型的可学习性,实验结果表明在自然语言中实现了完美分离并与语言学约束一致。
English: The study demonstrates that all standard subregular language classes are linearly separable through their deciding predicates, ensuring finite observability and learnability with linear models, with experimental results validating perfect separability and alignment with linguistic constraints in natural language.

Authors:Mo El-Haj
Title: ArabJobs: A Multinational Corpus of Arabic Job Ads
Abstract:
ArabJobs is a publicly available corpus of Arabic job advertisements collected from Egypt, Jordan, Saudi Arabia, and the United Arab Emirates. Comprising over 8,500 postings and more than 550,000 words, the dataset captures linguistic, regional, and socio-economic variation in the Arab labour market. We present analyses of gender representation and occupational structure, and highlight dialectal variation across ads, which offers opportunities for future research. We also demonstrate applications such as salary estimation and job category normalisation using large language models, alongside benchmark tasks for gender bias detection and profession classification. The findings show the utility of ArabJobs for fairness-aware Arabic NLP and labour market research. The dataset is publicly available on GitHub: https://github.com/drelhaj/ArabJobs.
中文摘要:ArabJobs是一个包含四个阿拉伯国家招聘广告的公开语料库,支持通过自然语言处理进行劳动力市场多样性研究及薪资预测等应用。
English Summary: ArabJobs is a comprehensive public dataset of Arabic job ads from four Arab countries, enabling research on labor market variations and applications like salary estimation and bias detection through NLP.

Authors:Guannan Lai, Da-Wei Zhou, Xin Yang, Han-Jia Ye
Title: The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?
Abstract:
Class Incremental Learning (CIL) requires models to continuously learn new classes without forgetting previously learned ones, while maintaining stable performance across all possible class sequences. In real-world settings, the order in which classes arrive is diverse and unpredictable, and model performance can vary substantially across different sequences. Yet mainstream evaluation protocols calculate mean and variance from only a small set of randomly sampled sequences. Our theoretical analysis and empirical results demonstrate that this sampling strategy fails to capture the full performance range, resulting in biased mean estimates and a severe underestimation of the true variance in the performance distribution. We therefore contend that a robust CIL evaluation protocol should accurately characterize and estimate the entire performance distribution. To this end, we introduce the concept of extreme sequences and provide theoretical justification for their crucial role in the reliable evaluation of CIL. Moreover, we observe a consistent positive correlation between inter-task similarity and model performance, a relation that can be leveraged to guide the search for extreme sequences. Building on these insights, we propose EDGE (Extreme case-based Distribution and Generalization Evaluation), an evaluation protocol that adaptively identifies and samples extreme class sequences using inter-task similarity, offering a closer approximation of the ground-truth performance distribution. Extensive experiments demonstrate that EDGE effectively captures performance extremes and yields more accurate estimates of distributional boundaries, providing actionable insights for model selection and robustness checking. Our code is available at https://github.com/AIGNLAI/EDGE.
中文: 类增量学习(CIL)评估通过EDGE协议得到改进,该协议利用任务间相似性识别极端类别序列,以更准确全面地评估性能分布,解决了现有方法低估方差的不足。
English: Class Incremental Learning (CIL) evaluation is enhanced by the EDGE protocol, which uses inter-task similarity to identify extreme class sequences for a more accurate and comprehensive performance distribution assessment, addressing the limitations of current methods that underestimate variance.

Authors:Jinfeng Zhou, Zheyu Chen, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, Minlie Huang
Title: Think Socially via Cognitive Reasoning
Abstract:
LLMs trained for logical reasoning excel at step-by-step deduction to reach verifiable answers. However, this paradigm is ill-suited for navigating social situations, which induce an interpretive process of analyzing ambiguous cues that rarely yield a definitive outcome. To bridge this gap, we introduce Cognitive Reasoning, a paradigm modeled on human social cognition. It formulates the interpretive process into a structured cognitive flow of interconnected cognitive units (e.g., observation or attribution), which combine adaptively to enable effective social thinking and responses. We then propose CogFlow, a complete framework that instills this capability in LLMs. CogFlow first curates a dataset of cognitive flows by simulating the associative and progressive nature of human thought via tree-structured planning. After instilling the basic cognitive reasoning capability via supervised fine-tuning, CogFlow adopts reinforcement learning to enable the model to improve itself via trial and error, guided by a multi-objective reward that optimizes both cognitive flow and response quality. Extensive experiments show that CogFlow effectively enhances the social cognitive capabilities of LLMs, and even humans, leading to more effective social decision-making.
中文:针对逻辑推理训练的大语言模型难以处理模糊的社会情境,因此我们提出了认知推理范式及CogFlow框架,通过结构化认知流程和强化学习增强其社会认知能力,从而提升社会决策效果。
English: Large language models trained for logical reasoning struggle with social situations due to their ambiguity, so we introduce Cognitive Reasoning and CogFlow, a framework that enhances social cognition in LLMs through structured cognitive flows and reinforcement learning, improving social decision-making.

Authors:Zhenqi He, Yuanpei Liu, Kai Han
Title: Category Discovery: An Open-World Perspective
Abstract:
Category discovery (CD) is an emerging open-world learning task, which aims at automatically categorizing unlabelled data containing instances from unseen classes, given some labelled data from seen classes. This task has attracted significant attention over the years and leads to a rich body of literature trying to address the problem from different perspectives. In this survey, we provide a comprehensive review of the literature, and offer detailed analysis and in-depth discussion on different methods. Firstly, we introduce a taxonomy for the literature by considering two base settings, namely novel category discovery (NCD) and generalized category discovery (GCD), and several derived settings that are designed to address the extra challenges in different real-world application scenarios, including continual category discovery, skewed data distribution, federated category discovery, etc. Secondly, for each setting, we offer a detailed analysis of the methods encompassing three fundamental components, representation learning, label assignment, and estimation of class number. Thirdly, we benchmark all the methods and distill key insights showing that large-scale pretrained backbones, hierarchical and auxiliary cues, and curriculum-style training are all beneficial for category discovery, while challenges remain in the design of label assignment, the estimation of class numbers, and scaling to complex multi-object scenarios. Finally, we discuss the key insights from the literature so far and point out promising future research directions. We compile a living survey of the category discovery literature at https://github.com/Visual-AI/Category-Discovery.
中文: 类别发现是一种开放世界学习任务,利用已见类别的标注数据对未见类别的未标注数据进行自动分类,本综述系统梳理了不同设定下的方法体系,通过基准测试指出表征学习和课程式训练等有效策略,同时揭示了类别数量估计等关键挑战与发展方向。
English: Category discovery is an open-world learning task that groups unlabeled data from unseen classes using labeled data from seen classes, with this survey providing a comprehensive taxonomy, method analysis, and benchmarking of approaches while highlighting challenges like class number estimation and future research directions.

Authors:Yonghan Jung
Title: Debiased Front-Door Learners for Heterogeneous Effects
Abstract:
In observational settings where treatment and outcome share unmeasured confounders but an observed mediator remains unconfounded, the front-door (FD) adjustment identifies causal effects through the mediator. We study the heterogeneous treatment effect (HTE) under FD identification and introduce two debiased learners: FD-DR-Learner and FD-R-Learner. Both attain fast, quasi-oracle rates (i.e., performance comparable to an oracle that knows the nuisances) even when nuisance functions converge as slowly as n^-1/4. We provide error analyses establishing debiasedness and demonstrate robust empirical performance in synthetic studies and a real-world case study of primary seat-belt laws using Fatality Analysis Reporting System (FARS) dataset. Together, these results indicate that the proposed learners deliver reliable and sample-efficient HTE estimates in FD scenarios. The implementation is available at https://github.com/yonghanjung/FD-CATE. Keywords: Front-door adjustment; Heterogeneous treatment effects; Debiased learning; Quasi-oracle rates; Causal inference.
中文: 本研究提出了FD-DR-Learner和FD-R-Learner两种去偏学习器,在前门调整下即使存在收敛较慢的干扰函数,也能以准神谕速率快速估计异质处理效应,并在合成与真实数据中验证了其可靠性。
English: The study introduces FD-DR-Learner and FD-R-Learner, two debiased learners that achieve fast, quasi-oracle rates for estimating heterogeneous treatment effects under front-door adjustment, even with slow-converging nuisance functions, and demonstrates their reliability in synthetic and real-world datasets.

Authors:Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Xiaochun Cao
Title: Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation
Abstract:
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in aligning visual inputs with natural language outputs. Yet, the extent to which generated tokens depend on visual modalities remains poorly understood, limiting interpretability and reliability. In this work, we present EAGLE, a lightweight black-box framework for explaining autoregressive token generation in MLLMs. EAGLE attributes any selected tokens to compact perceptual regions while quantifying the relative influence of language priors and perceptual evidence. The framework introduces an objective function that unifies sufficiency (insight score) and indispensability (necessity score), optimized via greedy search over sparsified image regions for faithful and efficient attribution. Beyond spatial attribution, EAGLE performs modality-aware analysis that disentangles what tokens rely on, providing fine-grained interpretability of model decisions. Extensive experiments across open-source MLLMs show that EAGLE consistently outperforms existing methods in faithfulness, localization, and hallucination diagnosis, while requiring substantially less GPU memory. These results highlight its effectiveness and practicality for advancing the interpretability of MLLMs. The code is available at https://github.com/RuoyuChen10/EAGLE.
中文: EAGLE是一个轻量级黑盒框架,通过将多模态大语言模型的令牌生成归因于视觉区域并量化语言先验与感知证据的相对影响,显著提升了模型的可解释性,在忠实度和效率上均优于现有方法。
English: EAGLE is a lightweight black-box framework that enhances the interpretability of multimodal large language models by attributing token generation to visual regions and quantifying the influence of language priors versus perceptual evidence, outperforming existing methods in faithfulness and efficiency.

Authors:Guohui Zhang, Hu Yu, Xiaoxiao Ma, JingHao Zhang, Yaning Pan, Mingde Yao, Jie Xiao, Linjiang Huang, Feng Zhao
Title: Group Critical-token Policy Optimization for Autoregressive Image Generation
Abstract:
Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR's training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30\% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.
Chinese: 近期研究提出关键令牌组策略优化(GCPO),通过基于因果依赖、熵梯度和令牌多样性识别关键图像令牌进行选择性优化,在仅使用30%令牌的情况下实现了比全令牌优化更优的自回归视觉生成效果。
English: Recent research introduces Group Critical-token Policy Optimization (GCPO), which enhances reinforcement learning for autoregressive visual generation by selectively optimizing critical image tokens identified through causal dependency, entropy gradients, and token diversity, achieving superior performance with only 30% of tokens.

Authors:Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, Nezihe Merve Gürel
Title: Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning
Abstract:
In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta's LLaMA, OpenAI's ChatGPT, Google's Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.
中文摘要:本研究评估了大型语言模型在多语言法律任务中的表现,发现其在法律推理任务中准确率常低于50%,且存在对抗性攻击漏洞,表明当前模型尚无法可靠应用于高风险的法律领域。
English Summary: This study evaluates the performance of Large Language Models like LLaMA and Gemini on multilingual legal tasks, revealing significant challenges with accuracies often below 50% and persistent vulnerabilities to adversarial attacks, highlighting their current limitations for high-stakes legal applications.

Authors:Alejandro Almodóvar, Patricia A. Apellániz, Santiago Zazo, Juan Parras
Title: CausalKANs: interpretable treatment effect estimation with Kolmogorov-Arnold networks
Abstract:
Deep neural networks achieve state-of-the-art performance in estimating heterogeneous treatment effects, but their opacity limits trust and adoption in sensitive domains such as medicine, economics, and public policy. Building on well-established and high-performing causal neural architectures, we propose causalKANs, a framework that transforms neural estimators of conditional average treatment effects (CATEs) into Kolmogorov--Arnold Networks (KANs). By incorporating pruning and symbolic simplification, causalKANs yields interpretable closed-form formulas while preserving predictive accuracy. Experiments on benchmark datasets demonstrate that causalKANs perform on par with neural baselines in CATE error metrics, and that even simple KAN variants achieve competitive performance, offering a favorable accuracy--interpretability trade-off. By combining reliability with analytic accessibility, causalKANs provide auditable estimators supported by closed-form expressions and interpretable plots, enabling trustworthy individualized decision-making in high-stakes settings. We release the code for reproducibility at https://github.com/aalmodovares/causalkans .
中文:提出的causalKANs框架将神经网络的因果效应估计转化为可解释的科尔莫戈罗夫-阿诺德网络,在保持预测准确性的同时提供透明的闭式公式,为高风险应用中的决策建立可信基础。
English: The proposed causalKANs framework transforms neural treatment effect estimators into interpretable Kolmogorov-Arnold Networks, maintaining predictive accuracy while providing transparent closed-form formulas for trustworthy decision-making in high-stakes applications.

Authors:Hui Li, Changhao Jiang, Hongyu Wang, Ming Zhang, Jiajun Sun, Zhixiong Yang, Yifei Cao, Shihan Dou, Xiaoran Fan, Baoyu Fan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Title: MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark
Abstract:
The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.
中文: MDAR基准通过3,000个复杂音频推理任务评估AI模型,发现现有系统在单选项、多选项和开放式问题上均未达到80%准确率,突显了音频推理领域的独特挑战。
English: The MDAR benchmark introduces 3,000 complex audio reasoning tasks to evaluate AI models, revealing limitations in current systems as none achieve 80% accuracy across single-choice, multiple-choice, and open-ended questions.

Authors:Changhun Kim, Timon Conrad, Redwanul Karim, Julian Oelhaf, David Riebesel, Tomás Arias-Vergara, Andreas Maier, Johann Jäger, Siming Bayer
Title: Physics-informed GNN for medium-high voltage AC power flow with edge-aware attention and line search correction operator
Abstract:
Physics-informed graph neural networks (PIGNNs) have emerged as fast AC power-flow solvers that can replace classic Newton--Raphson (NR) solvers, especially when thousands of scenarios must be evaluated. However, current PIGNNs still need accuracy improvements at parity speed; in particular, the physics loss is inoperative at inference, which can deter operational adoption. We address this with PIGNN-Attn-LS, combining an edge-aware attention mechanism that explicitly encodes line physics via per-edge biases, capturing the grid's anisotropy, with a backtracking line-search-based globalized correction operator that restores an operative decrease criterion at inference. Training and testing use a realistic High-/Medium-Voltage scenario generator, with NR used only to construct reference states. On held-out HV cases consisting of 4--32-bus grids, PIGNN-Attn-LS achieves a test RMSE of 0.00033 p.u. in voltage and 0.08$^\circ$ in angle, outperforming the PIGNN-MLP baseline by 99.5\% and 87.1\%, respectively. With streaming micro-batches, it delivers 2--5$\times$ faster batched inference than NR on 4--1024-bus grids.
中文:PIGNN-Attn-LS通过结合边缘感知注意力机制和回溯线性搜索校正,显著提升了物理信息图神经网络的性能,在电压和角度误差上分别比基线降低99.5%和87.1%,推理速度比牛顿-拉弗森法快2-5倍。
English: PIGNN-Attn-LS enhances physics-informed graph neural networks by integrating an edge-aware attention mechanism and a backtracking line-search correction, achieving superior accuracy with a 99.5% reduction in voltage RMSE and 87.1% in angle error, while providing 2-5 times faster inference than Newton-Raphson solvers.

Authors:Mishal Fatima, Shashank Agnihotri, Marius Bock, Kanchana Vaishnavi Gandikota, Kristof Van Laerhoven, Michael Moeller, Margret Keuper
Title: $γ$-Quant: Towards Learnable Quantization for Low-bit Pattern Recognition
Abstract:
Most pattern recognition models are developed on pre-proce\-ssed data. In computer vision, for instance, RGB images processed through image signal processing (ISP) pipelines designed to cater to human perception are the most frequent input to image analysis networks. However, many modern vision tasks operate without a human in the loop, raising the question of whether such pre-processing is optimal for automated analysis. Similarly, human activity recognition (HAR) on body-worn sensor data commonly takes normalized floating-point data arising from a high-bit analog-to-digital converter (ADC) as an input, despite such an approach being highly inefficient in terms of data transmission, significantly affecting the battery life of wearable devices. In this work, we target low-bandwidth and energy-constrained settings where sensors are limited to low-bit-depth capture. We propose $γ$-Quant, i.e.~the task-specific learning of a non-linear quantization for pattern recognition. We exemplify our approach on raw-image object detection as well as HAR of wearable data, and demonstrate that raw data with a learnable quantization using as few as 4-bits can perform on par with the use of raw 12-bit data. All code to reproduce our experiments is publicly available via https://github.com/Mishalfatima/Gamma-Quant
中文摘要:本研究提出γ-Quant方法,通过任务特定的非线性量化学习,仅用4位原始数据即可实现与12位数据相当的模式识别性能,有效解决了计算机视觉和人体活动识别中的能效挑战。
English Summary: This study introduces γ-Quant, a method for learning task-specific non-linear quantization that enables pattern recognition using only 4-bit raw data while matching the performance of 12-bit data, addressing efficiency challenges in computer vision and human activity recognition applications.

Authors:Pei Xu, Zhen Wu, Ruocheng Wang, Vishnu Sarukkai, Kayvon Fatahalian, Ioannis Karamouzas, Victor Zordan, C. Karen Liu
Title: Learning to Ball: Composing Policies for Long-Horizon Basketball Moves
Abstract:
Learning a control policy for a multi-phase, long-horizon task, such as basketball maneuvers, remains challenging for reinforcement learning approaches due to the need for seamless policy composition and transitions between skills. A long-horizon task typically consists of distinct subtasks with well-defined goals, separated by transitional subtasks with unclear goals but critical to the success of the entire task. Existing methods like the mixture of experts and skill chaining struggle with tasks where individual policies do not share significant commonly explored states or lack well-defined initial and terminal states between different phases. In this paper, we introduce a novel policy integration framework to enable the composition of drastically different motor skills in multi-phase long-horizon tasks with ill-defined intermediate states. Based on that, we further introduce a high-level soft router to enable seamless and robust transitions between the subtasks. We evaluate our framework on a set of fundamental basketball skills and challenging transitions. Policies trained by our approach can effectively control the simulated character to interact with the ball and accomplish the long-horizon task specified by real-time user commands, without relying on ball trajectory references.
中文: 本文提出了一种新颖的策略集成框架和高级软路由机制,能够在多阶段长时程任务中实现截然不同运动技能的无缝组合与鲁棒过渡,并成功应用于无需依赖篮球轨迹参考的篮球动作控制。
English: This paper introduces a novel policy integration framework and a high-level soft router to enable seamless composition and robust transitions between drastically different motor skills in multi-phase long-horizon tasks, successfully applied to basketball maneuvers without relying on ball trajectory references.

Authors:Ziheng Chi, Yifan Hou, Chenxi Pang, Shaobo Cui, Mubashara Akhtar, Mrinmaya Sachan
Title: Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding
Abstract:
Diagrams convey symbolic information in a visual format rather than a linear stream of words, making them especially challenging for AI models to process. While recent evaluations suggest that vision-language models (VLMs) perform well on diagram-related benchmarks, their reliance on knowledge, reasoning, or modality shortcuts raises concerns about whether they genuinely understand and reason over diagrams. To address this gap, we introduce Chimera, a comprehensive test suite comprising 7,500 high-quality diagrams sourced from Wikipedia; each diagram is annotated with its symbolic content represented by semantic triples along with multi-level questions designed to assess four fundamental aspects of diagram comprehension: entity recognition, relation understanding, knowledge grounding, and visual reasoning. We use Chimera to measure the presence of three types of shortcuts in visual question answering: (1) the visual-memorization shortcut, where VLMs rely on memorized visual patterns; (2) the knowledge-recall shortcut, where models leverage memorized factual knowledge instead of interpreting the diagram; and (3) the Clever-Hans shortcut, where models exploit superficial language patterns or priors without true comprehension. We evaluate 15 open-source VLMs from 7 model families on Chimera and find that their seemingly strong performance largely stems from shortcut behaviors: visual-memorization shortcuts have slight impact, knowledge-recall shortcuts play a moderate role, and Clever-Hans shortcuts contribute significantly. These findings expose critical limitations in current VLMs and underscore the need for more robust evaluation protocols that benchmark genuine comprehension of complex visual inputs (e.g., diagrams) rather than question-answering shortcuts.
中文: 图表因其符号化视觉特性对AI处理构成独特挑战,尽管视觉语言模型在图表任务上表现良好,但Chimera测试套件揭示其性能主要依赖记忆化、知识召回和语言模式等捷径而非真正理解,暴露出当前模型的根本缺陷。
English: Diagrams pose unique challenges for AI processing due to their symbolic visual nature, and while vision-language models appear competent on diagram tasks, the Chimera test suite reveals their performance heavily relies on shortcuts rather than genuine comprehension, exposing critical limitations in current models.

Authors:Haoyu Li, XiaoSong Li
Title: Gradient-based multi-focus image fusion with focus-aware saliency enhancement
Abstract:
Multi-focus image fusion (MFIF) aims to yield an all-focused image from multiple partially focused inputs, which is crucial in applications cover sur-veillance, microscopy, and computational photography. However, existing methods struggle to preserve sharp focus-defocus boundaries, often resulting in blurred transitions and focused details loss. To solve this problem, we propose a MFIF method based on significant boundary enhancement, which generates high-quality fused boundaries while effectively detecting focus in-formation. Particularly, we propose a gradient-domain-based model that can obtain initial fusion results with complete boundaries and effectively pre-serve the boundary details. Additionally, we introduce Tenengrad gradient detection to extract salient features from both the source images and the ini-tial fused image, generating the corresponding saliency maps. For boundary refinement, we develop a focus metric based on gradient and complementary information, integrating the salient features with the complementary infor-mation across images to emphasize focused regions and produce a high-quality initial decision result. Extensive experiments on four public datasets demonstrate that our method consistently outperforms 12 state-of-the-art methods in both subjective and objective evaluations. We have realized codes in https://github.com/Lihyua/GICI
中文摘要:本文提出了一种基于显著边界增强的多焦点图像融合方法,通过梯度域模型和Tenengrad梯度检测优化边界细节,在四个公开数据集上的实验表明该方法在主观和客观评估中均优于12种先进方法。
English Summary: This paper introduces a multi-focus image fusion method that enhances boundary quality through gradient-domain modeling and Tenengrad detection, demonstrating superior performance over existing methods in preserving sharp focus transitions.

Authors:Nikita Kotelevskii, Maiya Goloburda, Vladimir Kondratyev, Alexander Fishkov, Mohsen Guizani, Eric Moulines, Maxim Panov
Title: Multidimensional Uncertainty Quantification via Optimal Transport
Abstract:
Most uncertainty quantification (UQ) approaches provide a single scalar value as a measure of model reliability. However, different uncertainty measures could provide complementary information on the prediction confidence. Even measures targeting the same type of uncertainty (e.g., ensemble-based and density-based measures of epistemic uncertainty) may capture different failure modes. We take a multidimensional view on UQ by stacking complementary UQ measures into a vector. Such vectors are assigned with Monge-Kantorovich ranks produced by an optimal-transport-based ordering method. The prediction is then deemed more uncertain than the other if it has a higher rank. The resulting VecUQ-OT algorithm uses entropy-regularized optimal transport. The transport map is learned on vectors of scores from in-distribution data and, by design, applies to unseen inputs, including out-of-distribution cases, without retraining. Our framework supports flexible non-additive uncertainty fusion (including aleatoric and epistemic components). It yields a robust ordering for downstream tasks such as selective prediction, misclassification detection, out-of-distribution detection, and selective generation. Across synthetic, image, and text data, VecUQ-OT shows high efficiency even when individual measures fail. The code for the method is available at: https://github.com/stat-ml/multidimensional_uncertainty.
Chinese: VecUQ-OT框架通过将互补的不确定性度量组合成向量并采用最优传输方法进行排序,提出了一种多维不确定性量化方法,无需重新训练即可为多种下游任务提供稳健的不确定性排序。
English: The VecUQ-OT framework introduces a multidimensional approach to uncertainty quantification by combining complementary measures into vectors and ranking them using optimal transport, enabling robust uncertainty ordering for various downstream tasks without requiring retraining.

Authors:Zijian Zhao, Dian Jin, Zijing Zhou
Title: Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach
Abstract:
Recently, Image-to-Music (I2M) generation has garnered significant attention, with potential applications in fields such as gaming, advertising, and multi-modal art creation. However, due to the ambiguous and subjective nature of I2M tasks, most end-to-end methods lack interpretability, leaving users puzzled about the generation results. Even methods based on emotion mapping face controversy, as emotion represents only a singular aspect of art. Additionally, most learning-based methods require substantial computational resources and large datasets for training, hindering accessibility for common users. To address these challenges, we propose the first Vision Language Model (VLM)-based I2M framework that offers high interpretability and low computational cost. Specifically, we utilize ABC notation to bridge the text and music modalities, enabling the VLM to generate music using natural language. We then apply multi-modal Retrieval-Augmented Generation (RAG) and self-refinement techniques to allow the VLM to produce high-quality music without external training. Furthermore, we leverage the generated motivations in text and the attention maps from the VLM to provide explanations for the generated results in both text and image modalities. To validate our method, we conduct both human studies and machine evaluations, where our method outperforms others in terms of music quality and music-image consistency, indicating promising results. Our code is available at https://github.com/RS2002/Image2Music .
中文:该研究提出首个基于视觉语言模型的图像到音乐生成框架,通过ABC记谱法和多模态检索增强技术实现无需外部训练的高质量音乐生成,并利用文本动机和注意力图谱提供双模态解释,在评估中展现出优越的音乐质量与图文一致性。
English: The proposed Vision Language Model-based Image-to-Music framework overcomes interpretability and computational barriers by using ABC notation and multi-modal techniques to generate high-quality music with dual-modality explanations, achieving superior results in evaluations.

Authors:Niharika Hegde, Subarnaduti Paul, Lars Joel-Frey, Manuel Brack, Kristian Kersting, Martin Mundt, Patrick Schramowski
Title: CHRONOBERG: Capturing Language Evolution and Temporal Awareness in Foundation Models
Abstract:
Large language models (LLMs) excel at operating at scale by leveraging social media and various data crawled from the web. Whereas existing corpora are diverse, their frequent lack of long-term temporal structure may however limit an LLM's ability to contextualize semantic and normative evolution of language and to capture diachronic variation. To support analysis and training for the latter, we introduce CHRONOBERG, a temporally structured corpus of English book texts spanning 250 years, curated from Project Gutenberg and enriched with a variety of temporal annotations. First, the edited nature of books enables us to quantify lexical semantic change through time-sensitive Valence-Arousal-Dominance (VAD) analysis and to construct historically calibrated affective lexicons to support temporally grounded interpretation. With the lexicons at hand, we demonstrate a need for modern LLM-based tools to better situate their detection of discriminatory language and contextualization of sentiment across various time-periods. In fact, we show how language models trained sequentially on CHRONOBERG struggle to encode diachronic shifts in meaning, emphasizing the need for temporally aware training and evaluation pipelines, and positioning CHRONOBERG as a scalable resource for the study of linguistic change and temporal generalization. Disclaimer: This paper includes language and display of samples that could be offensive to readers. Open Access: Chronoberg is available publicly on HuggingFace at ( https://huggingface.co/datasets/spaul25/Chronoberg). Code is available at (https://github.com/paulsubarna/Chronoberg).
Chinese: CHRONOBERG是一个跨越250年的英语书籍时间标注语料库,旨在帮助大型语言模型更好地捕捉语言演变和历时意义变化,弥补现有训练数据的不足。
English: CHRONOBERG is a temporally annotated corpus of English books spanning 250 years, designed to help large language models better capture language evolution and diachronic meaning shifts, addressing limitations in current training data.

Authors:Xiao Wang, Shujuan Wu, Xiaoxia Cheng, Changwei Bi, Jin Tang, Bin Luo
Title: Pedestrian Attribute Recognition via Hierarchical Cross-Modality HyperGraph Learning
Abstract:
Current Pedestrian Attribute Recognition (PAR) algorithms typically focus on mapping visual features to semantic labels or attempt to enhance learning by fusing visual and attribute information. However, these methods fail to fully exploit attribute knowledge and contextual information for more accurate recognition. Although recent works have started to consider using attribute text as additional input to enhance the association between visual and semantic information, these methods are still in their infancy. To address the above challenges, this paper proposes the construction of a multi-modal knowledge graph, which is utilized to mine the relationships between local visual features and text, as well as the relationships between attributes and extensive visual context samples. Specifically, we propose an effective multi-modal knowledge graph construction method that fully considers the relationships among attributes and the relationships between attributes and vision tokens. To effectively model these relationships, this paper introduces a knowledge graph-guided cross-modal hypergraph learning framework to enhance the standard pedestrian attribute recognition framework. Comprehensive experiments on multiple PAR benchmark datasets have thoroughly demonstrated the effectiveness of our proposed knowledge graph for the PAR task, establishing a strong foundation for knowledge-guided pedestrian attribute recognition. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR
中文摘要:本文提出了一种多模态知识图谱,通过建模视觉特征与属性文本之间的关系来提升行人属性识别性能,并在多个基准数据集上通过全面实验验证了其有效性。
English Summary: This paper introduces a multi-modal knowledge graph to enhance pedestrian attribute recognition by modeling relationships between visual features and attribute texts, validated through comprehensive experiments on benchmark datasets.

Authors:Pierrick Chatillon, Julien Rabin, David Tschumperlé
Title: NIFTY: a Non-Local Image Flow Matching for Texture Synthesis
Abstract:
This paper addresses the problem of exemplar-based texture synthesis. We introduce NIFTY, a hybrid framework that combines recent insights on diffusion models trained with convolutional neural networks, and classical patch-based texture optimization techniques. NIFTY is a non-parametric flow-matching model built on non-local patch matching, which avoids the need for neural network training while alleviating common shortcomings of patch-based methods, such as poor initialization or visual artifacts. Experimental results demonstrate the effectiveness of the proposed approach compared to representative methods from the literature. Code is available at https://github.com/PierrickCh/Nifty.git
中文: 本文提出NIFTY混合框架,结合扩散模型与基于斑块的纹理优化技术,无需神经网络训练即可解决基于范例的纹理合成问题,并有效克服传统斑块方法的常见缺陷。
English: This paper introduces NIFTY, a hybrid framework for exemplar-based texture synthesis that combines diffusion models with patch-based optimization, eliminating neural network training while overcoming common limitations of patch methods.

Authors:Jinpeng Lu, Linghan Cai, Yinda Chen, Guo Tang, Songhan Jiang, Haoyuan Shi, Zhiwei Xiong
Title: Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
Abstract:
Lightweight 3D medical image segmentation remains constrained by a fundamental "efficiency / robustness conflict", particularly when processing complex anatomical structures and heterogeneous modalities. In this paper, we study how to redesign the framework based on the characteristics of high-dimensional 3D images, and explore data synergy to overcome the fragile representation of lightweight methods. Our approach, VeloxSeg, begins with a deployable and extensible dual-stream CNN-Transformer architecture composed of Paired Window Attention (PWA) and Johnson-Lindenstrauss lemma-guided convolution (JLC). For each 3D image, we invoke a "glance-and-focus" principle, where PWA rapidly retrieves multi-scale information, and JLC ensures robust local feature extraction with minimal parameters, significantly enhancing the model's ability to operate with low computational budget. Followed by an extension of the dual-stream architecture that incorporates modal interaction into the multi-scale image-retrieval process, VeloxSeg efficiently models heterogeneous modalities. Finally, Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices injects the texture prior extracted by a self-supervised network into the segmentation network, yielding stronger representations than baselines at no extra inference cost. Experimental results on multimodal benchmarks show that VeloxSeg achieves a 26% Dice improvement, alongside increasing GPU throughput by 11x and CPU by 48x. Codes are available at https://github.com/JinPLu/VeloxSeg.
Chinese: VeloxSeg提出了一种双流CNN-Transformer架构,结合配对窗口注意力和约翰逊-林登斯特劳斯引理指导的卷积,通过模态交互和多尺度特征提取,在提升轻量化3D医学图像分割效率与鲁棒性的同时,实现了显著的性能改进和计算加速。
English: VeloxSeg introduces a dual-stream CNN-Transformer architecture with Paired Window Attention and Johnson-Lindenstrauss lemma-guided convolution, enhancing lightweight 3D medical image segmentation by improving efficiency and robustness across heterogeneous modalities while achieving significant performance gains and computational speedups.

Authors:Ke Li, Zheng Yang, Zhongbin Zhou, Feng Xue, Zhonglin Jiang, Wenxiao Wang
Title: HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
Abstract:
Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.
中文: HEAPr提出了一种新颖的原子专家剪枝方法,通过简化二阶信息计算,在保持20-25%压缩比下实现近乎无损的模型压缩,同时降低计算成本,性能优于现有专家级剪枝方法。
English: HEAPr introduces a novel atomic expert pruning method for Mixture-of-Experts models that leverages simplified second-order information to achieve nearly lossless compression at 20-25% ratios while reducing computational costs, outperforming existing expert-level pruning techniques.

Authors:Aleksandar Terzić, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi
Title: Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models
Abstract:
Modern state-space models (SSMs) often utilize transition matrices which enable efficient computation but pose restrictions on the model's expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). Consequently, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N \times N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multiclass time-series classification, the performance is comparable to that of neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded as a set of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model
中文: PD-SSM方法通过结构化稀疏参数化实现了最优有限状态自动机模拟,在保持线性计算复杂度的同时,在状态追踪任务上显著优于现有状态空间模型变体。
English: The proposed PD-SSM method introduces a structured sparse parametrization for state-space models that achieves optimal finite-state automata emulation with linear computational scaling while significantly outperforming existing SSM variants on state tracking tasks.

Authors:Michael Jungo, Andreas Fischer
Title: Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models
Abstract:
Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.
中文摘要:基于规则的强化学习在文档图像分类任务中展现出更强的泛化能力,尤其在处理分布外图像、未见类别和不同模态数据时表现优异。
English Summary: Rule-based reinforcement learning shows improved generalization for document image classification, particularly with out-of-distribution data across images, classes, and modalities.

Authors:Yifang Zhang, Pengfei Duan, Yiwen Yang, Shengwu Xiong
Title: Beyond Textual Context: Structural Graph Encoding with Adaptive Space Alignment to alleviate the hallucination of LLMs
Abstract:
Currently, the main approach for Large Language Models (LLMs) to tackle the hallucination issue is incorporating Knowledge Graphs(KGs).However, LLMs typically treat KGs as plain text, extracting only semantic information and limiting their use of the crucial structural aspects of KGs. Another challenge is the gap between the embedding spaces of KGs encoders and LLMs text embeddings, which hinders the effective integration of structured knowledge. To overcome these obstacles, we put forward the SSKG-LLM, an innovative model architecture that is designed to efficiently integrate both the Structural and Semantic information of KGs into the reasoning processes of LLMs. SSKG-LLM incorporates the Knowledge Graph Retrieval (KGR) module and the Knowledge Graph Encoding (KGE) module to preserve semantics while utilizing structure. Then, the Knowledge Graph Adaptation (KGA) module is incorporated to enable LLMs to understand KGs embeddings. We conduct extensive experiments and provide a detailed analysis to explore how incorporating the structural information of KGs can enhance the factual reasoning abilities of LLMs. Our code are available at https://github.com/yfangZhang/SSKG-LLM.
中文: SSKG-LLM模型通过知识图谱检索、编码和适配模块,将知识图谱的结构与语义信息融入大语言模型的推理过程,有效提升了事实推理能力并缓解了幻觉问题。
English: The SSKG-LLM model is introduced to address LLM hallucinations by integrating both structural and semantic information from knowledge graphs through specialized modules, enhancing factual reasoning capabilities.

Authors:Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing
Abstract:
Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.
Chinese: FlashEdit是一种新颖的图像编辑框架,通过三项关键创新——OSIE绕过迭代过程、BG-Shield保护背景和SSCA实现精准局部编辑,实现了高保真度的实时编辑,速度提升150倍,编辑时间低于0.2秒。
English: FlashEdit is a novel framework that enables high-fidelity, real-time image editing by incorporating three key innovations—OSIE for bypassing iterative processes, BG-Shield for background preservation, and SSCA for precise localized edits—achieving a 150× speedup and edits in under 0.2 seconds.

Authors:Junbo Niu, Zheng Liu, Zhuangcheng Gu, Bin Wang, Linke Ouyang, Zhiyuan Zhao, Tao Chu, Tianyao He, Fan Wu, Qintong Zhang, Zhenjiang Jin, Guang Liang, Rui Zhang, Wenzheng Zhang, Yuan Qu, Zhifei Ren, Yuefeng Sun, Yuanhong Zheng, Dongsheng Ma, Zirui Tang, Boyu Niu, Ziyang Miao, Hejun Dong, Siyi Qian, Junyuan Zhang, Jingzhou Chen, Fangdong Wang, Xiaomeng Zhao, Liqun Wei, Wei Li, Shasha Wang, Ruiliang Xu, Yuanyuan Cao, Lu Chen, Qianqian Wu, Huaiyu Gu, Lindong Lu, Keming Wang, Dechen Lin, Guanlin Shen, Xuanhe Zhou, Linfeng Zhang, Yuhang Zang, Xiaoyi Dong, Jiaqi Wang, Bo Zhang, Lei Bai, Pei Chu, Weijia Li, Jiang Wu, Lijun Wu, Zhenxiang Li, Guangyu Wang, Zhongying Tu, Chao Xu, Kai Chen, Yu Qiao, Bowen Zhou, Dahua Lin, Wentao Zhang, Conghui He
Title: MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Abstract:
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
Chinese: MinerU2.5是一个12亿参数的视觉语言模型,采用从粗到精的两阶段解析策略,通过将全局布局分析与局部内容识别分离,在保持高计算效率的同时实现了最先进的文档解析性能。
English: MinerU2.5 is a 1.2B-parameter vision-language model that uses a two-stage parsing strategy to achieve state-of-the-art document recognition accuracy with high computational efficiency by decoupling layout analysis from content recognition.

Authors:Jianzhi Yan, Le Liu, Youcheng Pan, Shiwei Chen, Zike Yuan, Yang Xiang, Buzhou Tang
Title: From Long to Lean: Performance-aware and Adaptive Chain-of-Thought Compression via Multi-round Refinement
Abstract:
Chain-of-Thought (CoT) reasoning improves performance on complex tasks but introduces significant inference latency due to verbosity. We propose Multiround Adaptive Chain-of-Thought Compression (MACC), a framework that leverages the token elasticity phenomenon--where overly small token budgets can paradoxically increase output length--to progressively compress CoTs via multiround refinement. This adaptive strategy allows MACC to determine the optimal compression depth for each input. Our method achieves an average accuracy improvement of 5.6 percent over state-of-the-art baselines, while also reducing CoT length by an average of 47 tokens and significantly lowering latency. Furthermore, we show that test-time performance--accuracy and token length--can be reliably predicted using interpretable features like perplexity and compression rate on the training set. Evaluated across different models, our method enables efficient model selection and forecasting without repeated fine-tuning, demonstrating that CoT compression is both effective and predictable. Our code will be released in https://github.com/Leon221220/MACC.
中文:提出的MACC框架通过多轮优化自适应压缩思维链推理,在实现更高准确率和更低延迟的同时,利用可解释特征使性能变得可预测。
English: The proposed MACC framework adaptively compresses Chain-of-Thought reasoning through multiround refinement, achieving higher accuracy with shorter outputs and reduced latency while enabling predictable performance through interpretable features.

Authors:Primakov Chungkham, V Venktesh, Vinay Setty, Avishek Anand
Title: Think Right, Not More: Test-Time Scaling for Numerical Claim Verification
Abstract:
Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the reasoning drift issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on the task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VERIFIERFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the reasoning drift issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://github.com/VenkteshV/VerifierFC
中文: 本研究通过测试时扩展方法和验证器模型,有效缓解大型语言模型在数值声明事实核查中的推理漂移问题,并利用自适应计算显著提升了效率与性能。
English: This research introduces a test-time scaling method with a verifier model to enhance large language models' fact-checking of numerical claims by mitigating reasoning drift and improving efficiency through adaptive computation.

Authors:Inzamamul Alam, Md Tanvir Islam, Simon S. Woo
Title: SpecXNet: A Dual-Domain Convolutional Network for Robust Deepfake Detection
Abstract:
The increasing realism of content generated by GANs and diffusion models has made deepfake detection significantly more challenging. Existing approaches often focus solely on spatial or frequency-domain features, limiting their generalization to unseen manipulations. We propose the Spectral Cross-Attentional Network (SpecXNet), a dual-domain architecture for robust deepfake detection. The core \textbf{Dual-Domain Feature Coupler (DDFC)} decomposes features into a local spatial branch for capturing texture-level anomalies and a global spectral branch that employs Fast Fourier Transform to model periodic inconsistencies. This dual-domain formulation allows SpecXNet to jointly exploit localized detail and global structural coherence, which are critical for distinguishing authentic from manipulated images. We also introduce the \textbf{Dual Fourier Attention (DFA)} module, which dynamically fuses spatial and spectral features in a content-aware manner. Built atop a modified XceptionNet backbone, we embed the DDFC and DFA modules within a separable convolution block. Extensive experiments on multiple deepfake benchmarks show that SpecXNet achieves state-of-the-art accuracy, particularly under cross-dataset and unseen manipulation scenarios, while maintaining real-time feasibility. Our results highlight the effectiveness of unified spatial-spectral learning for robust and generalizable deepfake detection. To ensure reproducibility, we released the full code on \href{https://github.com/inzamamulDU/SpecXNet}{\textcolor{blue}{\textbf{GitHub}}}.
中文摘要:SpecXNet提出了一种结合空间与频谱特征的双域架构,通过创新模块实现了最先进的深度伪造检测,具有强大的泛化能力和实时性能。
English Summary: SpecXNet introduces a dual-domain architecture combining spatial and spectral features through novel modules, achieving state-of-the-art deepfake detection with strong generalization and real-time performance.

Authors:Yudong Li, Yufei Sun, Yuhan Yao, Peiru Yang, Wanyue Li, Jiajun Zou, Yongfeng Huang, Linlin Shen
Title: RedNote-Vibe: A Dataset for Capturing Temporal Dynamics of AI-Generated Text in Social Media
Abstract:
The proliferation of Large Language Models (LLMs) has led to widespread AI-Generated Text (AIGT) on social media platforms, creating unique challenges where content dynamics are driven by user engagement and evolve over time. However, existing datasets mainly depict static AIGT detection. In this work, we introduce RedNote-Vibe, the first longitudinal (5-years) dataset for social media AIGT analysis. This dataset is sourced from Xiaohongshu platform, containing user engagement metrics (e.g., likes, comments) and timestamps spanning from the pre-LLM period to July 2025, which enables research into the temporal dynamics and user interaction patterns of AIGT. Furthermore, to detect AIGT in the context of social media, we propose PsychoLinguistic AIGT Detection Framework (PLAD), an interpretable approach that leverages psycholinguistic features. Our experiments show that PLAD achieves superior detection performance and provides insights into the signatures distinguishing human and AI-generated content. More importantly, it reveals the complex relationship between these linguistic features and social media engagement. The dataset is available at https://github.com/testuser03158/RedNote-Vibe.
中文: 本研究推出了首个社交媒体AI生成文本的纵向数据集RedNote-Vibe,并提出了基于心理语言学特征的可解释检测框架PLAD,该框架不仅能有效识别生成内容,还揭示了语言特征与用户参与度之间的复杂关联。
English: This study introduces RedNote-Vibe, the first longitudinal dataset for analyzing AI-generated text on social media, and proposes PLAD, an interpretable detection framework using psycholinguistic features that reveals connections between linguistic patterns and user engagement.

Authors:Muxi Chen, Zhaohua Zhang, Chenchen Zhao, Mingyang Chen, Wenyu Jiang, Tianwen Jiang, Jianhuan Zhuo, Yu Tang, Qiuyong Xiao, Jihong Zhang, Qiang Xu
Title: FailureAtlas:Mapping the Failure Landscape of T2I Models via Active Exploration
Abstract:
Static benchmarks have provided a valuable foundation for comparing Text-to-Image (T2I) models. However, their passive design offers limited diagnostic power, struggling to uncover the full landscape of systematic failures or isolate their root causes. We argue for a complementary paradigm: active exploration. We introduce FailureAtlas, the first framework designed to autonomously explore and map the vast failure landscape of T2I models at scale. FailureAtlas frames error discovery as a structured search for minimal, failure-inducing concepts. While it is a computationally explosive problem, we make it tractable with novel acceleration techniques. When applied to Stable Diffusion models, our method uncovers hundreds of thousands of previously unknown error slices (over 247,000 in SD1.5 alone) and provides the first large-scale evidence linking these failures to data scarcity in the training set. By providing a principled and scalable engine for deep model auditing, FailureAtlas establishes a new, diagnostic-first methodology to guide the development of more robust generative AI. The code is available at https://github.com/cure-lab/FailureAtlas
中文摘要:FailureAtlas提出了首个主动探索框架,能够自主绘制文本到图像模型的系统性故障图谱,在Stable Diffusion中发现超24.7万个错误案例并揭示其与训练数据匮乏的关联。
English Summary: FailureAtlas introduces an active exploration framework that autonomously maps systematic failures in text-to-image models, revealing over 247,000 error cases in Stable Diffusion and linking them to training data scarcity.

Authors:Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim, Jaeyeon Kim, Tae-Ho Kim, Bo-Kyeong Kim
Title: ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
Abstract:
Efficient processing of high-resolution images is crucial for real-world vision-language applications. However, existing Large Vision-Language Models (LVLMs) incur substantial computational overhead due to the large number of vision tokens. With the advent of "thinking with images" models, reasoning now extends beyond text to the visual domain. This capability motivates our two-stage "coarse-to-fine" reasoning pipeline: first, a downsampled image is analyzed to identify task-relevant regions; then, only these regions are cropped at full resolution and processed in a subsequent reasoning stage. This approach reduces computational cost while preserving fine-grained visual details where necessary. A major challenge lies in inferring which regions are truly relevant to a given query. Recent related methods often fail in the first stage after input-image downsampling, due to perception-driven reasoning, where clear visual information is required for effective reasoning. To address this issue, we propose ERGO (Efficient Reasoning & Guided Observation) that performs reasoning-driven perception-leveraging multimodal context to determine where to focus. Our model can account for perceptual uncertainty, expanding the cropped region to cover visually ambiguous areas for answering questions. To this end, we develop simple yet effective reward components in a reinforcement learning framework for coarse-to-fine perception. Across multiple datasets, our approach delivers higher accuracy than the original model and competitive methods, with greater efficiency. For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup. The code and models can be found at: https://github.com/nota-github/ERGO.
中文: ERGO采用两阶段推理流程,先识别下采样图像中的任务相关区域,再仅对这些区域进行全分辨率处理,从而以显著降低的计算成本实现更高的准确率。
English: ERGO introduces a two-stage reasoning pipeline that first identifies task-relevant regions in downsampled images and then processes only those areas at full resolution, achieving higher accuracy with significantly reduced computational costs.

Authors:Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Title: Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have achieved exceptional performance across diverse tasks, continually surpassing previous expectations regarding their capabilities. Nevertheless, their proficiency in perceiving emotions from images remains debated, with studies yielding divergent results in zero-shot scenarios. We argue that this inconsistency stems partly from constraints in existing evaluation methods, including the oversight of plausible responses, limited emotional taxonomies, neglect of contextual factors, and labor-intensive annotations. To facilitate customized visual emotion evaluation for MLLMs, we propose an Emotion Statement Judgment task that overcomes these constraints. Complementing this task, we devise an automated pipeline that efficiently constructs emotion-centric statements with minimal human effort. Through systematically evaluating prevailing MLLMs, our study showcases their stronger performance in emotion interpretation and context-based emotion judgment, while revealing relative limitations in comprehending perception subjectivity. When compared to humans, even top-performing MLLMs like GPT4o demonstrate remarkable performance gaps, underscoring key areas for future improvement. By developing a fundamental evaluation framework and conducting a comprehensive MLLM assessment, we hope this work contributes to advancing emotional intelligence in MLLMs. Project page: https://github.com/wdqqdw/MVEI.
中文: 本研究提出情感陈述判断任务和自动化流程,以解决多模态大语言模型在视觉情感感知评估中的局限,发现其在情感解读方面表现较强,但与人类能力仍存在显著差距。
English: This study introduces an Emotion Statement Judgment task and automated pipeline to address limitations in evaluating Multimodal Large Language Models' (MLLMs) visual emotion perception, revealing their strengths in emotion interpretation but significant gaps compared to human performance.

Authors:Taeyoung Yun, Pierre-Luc St-Charles, Jinkyoo Park, Yoshua Bengio, Minsu Kim
Title: Active Attacks: Red-teaming LLMs via Adaptive Environments
Abstract:
We address the challenge of generating diverse attack prompts for large language models (LLMs) that elicit harmful behaviors (e.g., insults, sexual content) and are used for safety fine-tuning. Rather than relying on manual prompt engineering, attacker LLMs can be trained with reinforcement learning (RL) to automatically generate such prompts using only a toxicity classifier as a reward. However, capturing a wide range of harmful behaviors is a significant challenge that requires explicit diversity objectives. Existing diversity-seeking RL methods often collapse to limited modes: once high-reward prompts are found, exploration of new regions is discouraged. Inspired by the active learning paradigm that encourages adaptive exploration, we introduce \textit{Active Attacks}, a novel RL-based red-teaming algorithm that adapts its attacks as the victim evolves. By periodically safety fine-tuning the victim LLM with collected attack prompts, rewards in exploited regions diminish, which forces the attacker to seek unexplored vulnerabilities. This process naturally induces an easy-to-hard exploration curriculum, where the attacker progresses beyond easy modes toward increasingly difficult ones. As a result, Active Attacks uncovers a wide range of local attack modes step by step, and their combination achieves wide coverage of the multi-mode distribution. Active Attacks, a simple plug-and-play module that seamlessly integrates into existing RL objectives, unexpectedly outperformed prior RL-based methods -- including GFlowNets, PPO, and REINFORCE -- by improving cross-attack success rates against GFlowNets, the previous state-of-the-art, from 0.07% to 31.28% (a relative gain greater than $400\ \times$) with only a 6% increase in computation. Our code is publicly available \href{https://github.com/dbsxodud-11/active_attacks}{here}.
中文: 本文提出Active Attacks算法,通过周期性安全微调受害者模型来迫使攻击者探索新漏洞,从而自适应生成多样化有害提示,相比之前方法将攻击成功率提升了400倍。
English: This paper introduces Active Attacks, a reinforcement learning-based red-teaming algorithm that adaptively generates diverse harmful prompts by periodically fine-tuning the victim model, forcing the attacker to explore new vulnerabilities and achieving a 400-fold improvement in attack success rates over previous methods.

Authors:Zhengyan Wan, Yidong Ouyang, Liyan Xie, Fang Fang, Hongyuan Zha, Guang Cheng
Title: Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching
Abstract:
Guidance provides a simple and effective framework for posterior sampling by steering the generation process towards the desired distribution. When modeling discrete data, existing approaches mostly focus on guidance with the first-order Taylor approximation to improve the sampling efficiency. However, such an approximation is inappropriate in discrete state spaces since the approximation error could be large. A novel guidance framework for discrete data is proposed to address this problem: We derive the exact transition rate for the desired distribution given a learned discrete flow matching model, leading to guidance that only requires a single forward pass in each sampling step, significantly improving efficiency. This unified novel framework is general enough, encompassing existing guidance methods as special cases, and it can also be seamlessly applied to the masked diffusion model. We demonstrate the effectiveness of our proposed guidance on energy-guided simulations and preference alignment on text-to-image generation and multimodal understanding tasks. The code is available through https://github.com/WanZhengyan/Discrete-Guidance-Matching/tree/main.
中文: 针对离散数据提出的新型引导框架通过推导精确转移率实现高效单步采样,统一了现有方法并在文本到图像生成等任务中验证了有效性。
English: The proposed novel guidance framework for discrete data derives the exact transition rate for posterior sampling, enabling efficient single-pass generation and unifying existing methods while demonstrating effectiveness in tasks like text-to-image generation.

Authors:Yifei Peng, Yaoli Liu, Enbo Xia, Yu Jin, Wang-Zhou Dai, Zhong Ren, Yao-Xiang Ding, Kun Zhou
Title: Abductive Logical Rule Induction by Bridging Inductive Logic Programming and Multimodal Large Language Models
Abstract:
We propose ILP-CoT, a method that bridges Inductive Logic Programming (ILP) and Multimodal Large Language Models (MLLMs) for abductive logical rule induction. The task involves both discovering logical facts and inducing logical rules from a small number of unstructured textual or visual inputs, which still remain challenging when solely relying on ILP, due to the requirement of specified background knowledge and high computational cost, or MLLMs, due to the appearance of perceptual hallucinations. Based on the key observation that MLLMs could propose structure-correct rules even under hallucinations, our approach automatically builds ILP tasks with pruned search spaces based on the rule structure proposals from MLLMs, and utilizes ILP system to output rules built upon rectified logical facts and formal inductive reasoning. Its effectiveness is verified through challenging logical induction benchmarks, as well as a potential application of our approach, namely text-to-image customized generation with rule induction. Our code and data are released at https://github.com/future-item/ILP-CoT.
中文:ILP-CoT方法将归纳逻辑编程与多模态大语言模型相结合,通过利用大语言模型的结构化建议来优化归纳逻辑编程任务并减少感知误差,其有效性已在逻辑归纳基准测试和文本到图像生成应用中得以验证。
English: ILP-CoT integrates Inductive Logic Programming with Multimodal Large Language Models to enhance logical rule induction by leveraging MLLMs' structural proposals to streamline ILP tasks and mitigate perceptual errors, validated through benchmarks and text-to-image generation applications.

Authors:Junhao Chen, Yu Huang, Siyuan Li, Rui Yao, Hanqian Li, Hanyu Zhang, Jungang Li, Jian Chen, Bowen Wang, Xuming Hu
Title: KnowMT-Bench: Benchmarking Knowledge-Intensive Long-Form Question Answering in Multi-Turn Dialogues
Abstract:
Multi-Turn Long-Form Question Answering (MT-LFQA) is a key application paradigm of Large Language Models (LLMs) in knowledge-intensive domains. However, existing benchmarks are limited to single-turn dialogue, while multi-turn dialogue benchmarks typically assess other orthogonal capabilities rather than knowledge-intensive factuality. To bridge this critical gap, we introduce \textbf{KnowMT-Bench}, the \textit{first-ever} benchmark designed to systematically evaluate MT-LFQA for LLMs across knowledge-intensive fields, including medicine, finance, and law. To faithfully assess the model's real-world performance, KnowMT-Bench employs a dynamic evaluation setting where models generate their own multi-turn dialogue histories given logically progressive question sequences. The factual capability and information delivery efficiency of the \textit{final-turn} answer are then evaluated using a human-validated automated pipeline. Our experiments reveal that multi-turn contexts degrade performance: factual capability declines due to the contextual noise from self-generated histories, while information efficiency drops as models become more verbose with increasing dialogue length. We then investigate mitigation strategies, demonstrating that retrieval-augmented generation (RAG) can effectively alleviate and even reverse this factual degradation. These findings underscore the importance of our benchmark in evaluating and enhancing the conversational factual capabilities of LLMs in real-world knowledge-intensive applications. Code is available at \href{https://github.com/hardenyu21/KnowMT-Bench}{\textcolor{cyan}{\texttt{KnowMT-Bench}}}.
中文: KnowMT-Bench是首个针对知识密集型领域多轮长问答的评估基准,发现上下文噪声会导致性能下降,并证明检索增强生成能有效缓解事实性退化问题。
English: KnowMT-Bench is the first benchmark to evaluate multi-turn long-form question answering in knowledge-intensive fields, revealing performance degradation due to contextual noise and demonstrating that retrieval-augmented generation can mitigate factual decline.

Authors:Taejong Joo, Shu Ishida, Ivan Sosnovik, Bryan Lim, Sahand Rezaei-Shoshtari, Adam Gaier, Robert Giaquinto
Title: Graph of Agents: Principled Long Context Modeling by Emergent Multi-Agent Collaboration
Abstract:
As a model-agnostic approach to long context modeling, multi-agent systems can process inputs longer than a large language model's context window without retraining or architectural modifications. However, their performance often heavily relies on hand-crafted multi-agent collaboration strategies and prompt engineering, which limit generalizability. In this work, we introduce a principled framework that formalizes the model-agnostic long context modeling problem as a compression problem, yielding an information-theoretic compression objective. Building on this framework, we propose Graph of Agents (GoA), which dynamically constructs an input-dependent collaboration structure that maximizes this objective. For Llama 3.1 8B and Qwen3 8B across six document question answering benchmarks, GoA improves the average $F_1$ score of retrieval-augmented generation by 5.7\% and a strong multi-agent baseline using a fixed collaboration structure by 16.35\%, respectively. Even with only a 2K context window, GoA surpasses the 128K context window Llama 3.1 8B on LongBench, showing a dramatic increase in effective context length. Our source code is available at https://github.com/tjoo512/graph-of-agents.
中文: 本文提出Graph of Agents (GoA)框架,将模型无关的长上下文建模形式化为压缩问题,通过动态构建输入依赖的协作结构来优化信息论目标,在多个基准测试中显著超越了现有方法。
English: This paper introduces Graph of Agents (GoA), a principled framework that formalizes model-agnostic long context modeling as a compression problem and dynamically constructs input-dependent collaboration structures to maximize information-theoretic objectives, significantly outperforming existing methods across multiple benchmarks.

Authors:Shengxiang Xu, Jiayi Zhang, Shimin Di, Yuyu Luo, Liang Yao, Hanmo Liu, Jia Zhu, Fan Liu, Min-Ling Zhang
Title: RobustFlow: Towards Robust Agentic Workflow Generation
Abstract:
The automated generation of agentic workflows is a promising frontier for enabling large language models (LLMs) to solve complex tasks. However, our investigation reveals that the robustness of agentic workflow remains a critical, unaddressed challenge. Current methods often generate wildly inconsistent workflows when provided with instructions that are semantically identical but differently phrased. This brittleness severely undermines their reliability and trustworthiness for real-world applications. To quantitatively diagnose this instability, we propose metrics based on nodal and topological similarity to evaluate workflow consistency against common semantic variations such as paraphrasing and noise injection. Subsequently, we further propose a novel training framework, RobustFlow, that leverages preference optimization to teach models invariance to instruction variations. By training on sets of synonymous task descriptions, RobustFlow boosts workflow robustness scores to 70\% - 90\%, which is a substantial improvement over existing approaches. The code is publicly available at https://github.com/DEFENSE-SEU/RobustFlow.
中文: 智能体工作流生成面临语义相近指令导致输出不一致的鲁棒性挑战,而提出的RobustFlow框架通过偏好优化训练,将工作流一致性显著提升至70%-90%。
English: Agentic workflow generation for LLMs faces robustness issues with inconsistent outputs from semantically similar instructions, but the proposed RobustFlow framework significantly improves consistency to 70%-90% through preference optimization training.

Authors:Yu Shang, Yangcheng Yu, Xin Zhang, Xin Jin, Haisheng Su, Wei Wu, Yong Li
Title: MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation
Abstract:
Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model. This design allows MoWM to highlight the informative visual details needed for action decoding. Extensive evaluations on the CALVIN benchmark demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.
中文摘要:提出的MoWM框架融合了潜在模型的运动感知表征和像素模型的细粒度视觉特征,在CALVIN基准测试中实现了最优的任务成功率,显著提升了具身行动规划能力。
English Summary: The proposed MoWM framework combines motion-aware latent representations with fine-grained visual features from pixel models to enhance embodied action planning, achieving state-of-the-art performance on the CALVIN benchmark.

Authors:Yizhou Zhang, Ning Lv, Teng Wang, Jisheng Dang
Title: FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning
Abstract:
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models (LLMs) via reinforcement learning. However, its practical deployment is impeded by an excessively slow training process, primarily attributed to the computationally intensive autoregressive generation of multiple responses per query, which makes the generation phase the primary performance bottleneck. Although speculative decoding presents a promising direction for acceleration, its direct application in GRPO achieves limited speedup under high-concurrency training conditions. To overcome this limitation, we propose a concurrency-aware speculative decoding framework that dynamically adjusts the drafting and verification strategy according to real-time concurrency levels, thereby maximizing the acceleration of the generation process. Furthermore, to address performance degradation arising from distributional drift between the evolving target model and the fixed draft model during training, we introduce an online draft learning mechanism that enables the draft model to continuously adapt using feedback signals from the target model. Experimental results across multiple mathematical reasoning datasets and models demonstrate that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency. The code is available at https://github.com/yedaotian9/GRPO_speculative.
中文: 该研究提出的并发感知推测解码框架通过在线草稿学习机制,能够根据实时并发水平动态调整策略并持续优化草稿模型,在数学推理任务中实现了2.35至2.72倍的端到端加速效果。
English: The proposed concurrency-aware speculative decoding framework with online draft learning accelerates GRPO training by dynamically adapting to real-time concurrency levels and continuously updating the draft model, achieving 2.35x-2.72x speedup across mathematical reasoning tasks.

Authors:Yu Shang, Lei Jin, Yiding Ma, Xin Zhang, Chen Gao, Wei Wu, Yong Li
Title: LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE
Abstract:
Video-based world models hold significant potential for generating high-quality embodied manipulation data. However, current video generation methods struggle to achieve stable long-horizon generation: classical diffusion-based approaches often suffer from temporal inconsistency and visual drift over multiple rollouts, while autoregressive methods tend to compromise on visual detail. To solve this, we introduce LongScape, a hybrid framework that adaptively combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation. Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions. This ensures each chunk represents a complete, coherent action, enabling the model to flexibly generate diverse dynamics. We further introduce a Context-aware Mixture-of-Experts (CMoE) framework that adaptively activates specialized experts for each chunk during generation, guaranteeing high visual quality and seamless chunk transitions. Extensive experimental results demonstrate that our method achieves stable and consistent long-horizon generation over extended rollouts. Our code is available at: https://github.com/tsinghua-fib-lab/Longscape.
中文: LongScape提出了一种结合扩散去噪与自回归生成的混合框架,通过动作引导的分块机制和情境感知专家混合模型,实现了稳定高质量的长序列具身操作视频生成。
English: LongScape introduces a hybrid framework combining diffusion denoising and autoregressive generation with action-guided chunking and a Context-aware Mixture-of-Experts to achieve stable, high-quality long-horizon video generation for embodied manipulation.

Authors:Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan
Title: Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow
Abstract:
Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code will be available at: https://github.com/YU-deep/ViF.git.
中文: 多智能体视觉语言模型因视觉注意力减弱而产生幻觉雪球效应,而提出的ViF方法通过视觉流和注意力重分配有效缓解此问题,显著提升了多个基准测试的性能。
English: Multi-agent systems using visual language models are prone to hallucination snowballing due to reduced visual attention, which is effectively mitigated by the proposed ViF method that uses visual flow and attention reallocation to enhance performance across multiple benchmarks.

Authors:Lihao Zheng, Jiawei Chen, Xintian Shen, Hao Ma, Tao Wei
Title: MIRG-RL: Multi-Image Reasoning and Grounding with Reinforcement Learning
Abstract:
Multi-image reasoning and grounding require understanding complex cross-image relationships at both object levels and image levels. Current Large Visual Language Models (LVLMs) face two critical challenges: the lack of cross-image reasoning capabilities and insufficient cross-image reference reward modeling. To address these issues, we propose a unified framework - Multi-Image Reasoning and Grounding with Reinforcement Learning (MIRG-RL). Specifically, our two-stage training paradigm combines supervised fine-tuning with annotated trajectories and image-aware reinforcement learning optimization, progressively developing multi-image reasoning capabilities. Furthermore, we innovatively propose a method for constructing the trajectory data, which integrates object-level and image-level annotation information, and use this method to generate a lightweight reasoning-enhanced dataset. To effectively resolve cross-image ambiguities, we design an image-aware RL policy with dual reward functions for objects and images. Experiments demonstrate that MIRG-RL achieves state-of-the-art (SOTA) performance in multi-image grounding benchmarks, attaining 64.82% on cross-image reasoning tasks - exceeding the previous best method by 1%. The code and dataset have been released at https://github.com/ZEUS2035/MIRG-RL.
中文: 为解决大型视觉语言模型缺乏跨图像推理能力的问题,我们提出MIRG-RL统一框架,结合监督微调与图像感知强化学习,在多图像定位基准测试中实现了最先进的性能。
English: To address the lack of cross-image reasoning capabilities in Large Visual Language Models, we propose MIRG-RL, a unified framework that combines supervised fine-tuning with image-aware reinforcement learning, achieving state-of-the-art performance in multi-image grounding benchmarks.

Authors:Tianci Wu, Guangming Zhu, Jiang Lu, Siyuan Wang, Ning Wang, Nuoye Xiong, Zhang Liang
Title: Prompt-guided Representation Disentanglement for Action Recognition
Abstract:
Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git
Chinese: ProDA框架通过时空场景图和动态提示模块,从复杂多动作场景中分离指定动作,利用图解析神经网络生成动作特定表征,在视频动作识别中展现出优于现有方法的性能。
English: The ProDA framework introduces a novel approach to action recognition by disentangling specified actions from complex multi-action scenes using spatio-temporal scene graphs and a dynamic prompt module, achieving state-of-the-art performance through action-specific representations.

Authors:Junliang Liu, Jingyu Xiao, Wenxin Tang, Wenxuan Wang, Zhixian Wang, Minrui Zhang, Shuanghe Yu
Title: Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety
Abstract:
Multimodal large language models (MLLMs) are increasingly positioned as AI collaborators for building complex web-related applications like GUI agents and front-end code generation. However, existing benchmarks largely emphasize visual perception or UI code generation, showing insufficient evaluation on the reasoning, robustness and safety capability required for end-to-end web applications. To bridge the gap, we introduce a comprehensive web understanding benchmark, named WebRSSBench, that jointly evaluates Reasoning, Robustness, and Safety across eight tasks, such as position relationship reasoning, color robustness, and safety critical detection, etc. The benchmark is constructed from 729 websites and contains 3799 question answer pairs that probe multi-step inference over page structure, text, widgets, and safety-critical interactions. To ensure reliable measurement, we adopt standardized prompts, deterministic evaluation scripts, and multi-stage quality control combining automatic checks with targeted human verification. We evaluate 12 MLLMs on WebRSSBench. The results reveal significant gaps, models still struggle with compositional and cross-element reasoning over realistic layouts, show limited robustness when facing perturbations in user interfaces and content such as layout rearrangements or visual style shifts, and are rather conservative in recognizing and avoiding safety critical or irreversible actions. Our code is available at https://github.com/jinliang-byte/webssrbench.
中文总结:WebRSSBench基准测试填补了多模态大语言模型在网页应用推理、鲁棒性和安全性评估方面的空白,通过八项任务的综合测试揭示了现有模型在复杂网页理解能力上的显著不足。
English Summary: The WebRSSBench benchmark addresses the gap in evaluating multimodal large language models' reasoning, robustness, and safety for web applications, revealing significant shortcomings in current models' capabilities through comprehensive testing across eight tasks.

Authors:Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, Zixuan Hu, Hongze Mi, Yibo Wang, Naiqiang Tan, Hong Chen, Yi R. Fung, Chun Yuan, Li Shen
Title: UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
Abstract:
Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce \textbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average \textbf{200k+} tokens and \textbf{400+} tool calls, whereas in standard configurations they still exceed \textbf{35k} tokens and involve more than \textbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. \href{https://github.com/StarDewXXX/UltraHorizon}{Our code will be available here.}
中文: UltraHorizon基准测试旨在评估自主智能体在需要持续推理和工具使用的长周期、部分可观测任务中的表现,揭示了尽管经过大规模扩展,AI智能体与人类之间仍存在显著性能差距。
English: The UltraHorizon benchmark is introduced to evaluate autonomous agents in long-horizon, partially observable tasks requiring sustained reasoning and tool use, revealing significant performance gaps between AI agents and humans despite extensive scaling.

Authors:Lan Chen, Yuchao Gu, Qi Mao
Title: UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Abstract:
Large language models, trained on extensive corpora, successfully unify diverse linguistic tasks within a single generative framework. Inspired by this, recent works like Large Vision Model (LVM) extend this paradigm to vision by organizing tasks into sequential visual sentences, where visual prompts serve as the context to guide outputs. However, such modeling requires task-specific pre-training across modalities and sources, which is costly and limits scalability to unseen tasks. Given that pre-trained video generation models inherently capture temporal sequence dependencies, we explore a more unified and scalable alternative: can a pre-trained video generation model adapt to diverse image and video tasks? To answer this, we propose UniVid, a framework that fine-tunes a video diffusion transformer to handle various vision tasks without task-specific modifications. Tasks are represented as visual sentences, where the context sequence defines both the task and the expected output modality. We evaluate the generalization of UniVid from two perspectives: (1) cross-modal inference with contexts composed of both images and videos, extending beyond LVM's uni-modal setting; (2) cross-source tasks from natural to annotated data, without multi-source pre-training. Despite being trained solely on natural video data, UniVid generalizes well in both settings. Notably, understanding and generation tasks can easily switch by simply reversing the visual sentence order in this paradigm. These findings highlight the potential of pre-trained video generation models to serve as a scalable and unified foundation for vision modeling. Our code will be released at https://github.com/CUC-MIPG/UniVid.
中文摘要:UniVid框架通过视觉句子表示,将预训练视频生成模型应用于多种视觉任务,在跨模态和跨数据源的场景中展现出强大泛化能力,无需针对特定任务进行训练。
English Summary: The UniVid framework adapts a pre-trained video generation model to handle diverse vision tasks through visual sentence representations, demonstrating strong generalization across modalities and data sources without task-specific training.

Authors:Mehwish Mehmood, Ivor Spence, Muhammad Fahim
Title: LFA-Net: A Lightweight Network with LiteFusion Attention for Retinal Vessel Segmentation
Abstract:
Lightweight retinal vessel segmentation is important for the early diagnosis of vision-threatening and systemic diseases, especially in a real-world clinical environment with limited computational resources. Although segmentation methods based on deep learning are improving, existing models are still facing challenges of small vessel segmentation and high computational costs. To address these challenges, we proposed a new vascular segmentation network, LFA-Net, which incorporates a newly designed attention module, LiteFusion-Attention. This attention module incorporates residual learning connections, Vision Mamba-inspired dynamics, and modulation-based attention, enabling the model to capture local and global context efficiently and in a lightweight manner. LFA-Net offers high performance with 0.11 million parameters, 0.42 MB memory size, and 4.46 GFLOPs, which make it ideal for resource-constrained environments. We validated our proposed model on DRIVE, STARE, and CHASE_DB with outstanding performance in terms of dice scores of 83.28, 87.44, and 84.50% and Jaccard indices of 72.85, 79.31, and 74.70%, respectively. The code of LFA-Net is available online https://github.com/Mehwish4593/LFA-Net.
Chinese: 研究人员开发了LFA-Net轻量化视网膜血管分割网络,采用创新的LiteFusion-Attention模块,在低计算资源下实现高性能,非常适合临床诊断应用。
English: Researchers developed LFA-Net, a lightweight retinal vessel segmentation network featuring the innovative LiteFusion-Attention module, which achieves high performance with minimal computational resources, making it ideal for clinical diagnostics.

Authors:Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Christopher Ré, Scott W. Linderman
Title: A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems
Abstract:
Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.
中文: 本文提出了一个基于线性动力系统的统一框架,将多种并行化顺序模型的定点方法联系起来,为它们的有效性和可扩展计算潜力提供了理论依据。
English: This paper presents a unified framework based on linear dynamical systems that connects various fixed-point methods for parallelizing sequential models, offering theoretical insights into their effectiveness and potential for scalable computation.

Authors:Weikai Lin, Sushant Kondguli, Carl Marshall, Yuhao Zhu
Title: PowerGS: Display-Rendering Power Co-Optimization for Neural Rendering in Power-Constrained XR Systems
Abstract:
3D Gaussian Splatting (3DGS) combines classic image-based rendering, pointbased graphics, and modern differentiable techniques, and offers an interesting alternative to traditional physically-based rendering. 3DGS-family models are far from efficient for power-constrained Extended Reality (XR) devices, which need to operate at a Watt-level. This paper introduces PowerGS, the first framework to jointly minimize the rendering and display power in 3DGS under a quality constraint. We present a general problem formulation and show that solving the problem amounts to 1) identifying the iso-quality curve(s) in the landscape subtended by the display and rendering power and 2) identifying the power-minimal point on a given curve, which has a closed-form solution given a proper parameterization of the curves. PowerGS also readily supports foveated rendering for further power savings. Extensive experiments and user studies show that PowerGS achieves up to 86% total power reduction compared to state-of-the-art 3DGS models, with minimal loss in both subjective and objective quality. Code is available at https://github.com/horizon-research/PowerGS.
中文:PowerGS是首个针对功耗受限设备联合优化3D高斯泼溅渲染与显示功耗的框架,通过等质量曲线分析和注视点渲染技术,在保证画质的同时最高可降低86%总功耗。
English: PowerGS is a pioneering framework that jointly optimizes rendering and display power for 3D Gaussian Splatting models on power-constrained devices, achieving up to 86% power reduction while maintaining quality through iso-quality curve analysis and foveated rendering.

Authors:Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Ayan Biswas, Diane Oyen, Earl Lawrence
Title: MORPH: Shape-agnostic PDE Foundation Models
Abstract:
We introduce MORPH, a shape-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data dimensionality (1D--3D) at different resolutions, multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorizes full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters (LoRA), MORPH outperforms models trained from scratch in both zero-shot and full-shot generalization. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.
Chinese: MORPH是一种与形状无关的自回归PDE基础模型,能处理1D-3D异构时空数据集,通过创新的架构组件和高效训练技术,在泛化任务中超越现有模型表现。
English: MORPH is a shape-agnostic, autoregressive foundation model for PDEs that handles heterogeneous spatiotemporal datasets across 1D-3D dimensions and outperforms existing models in generalization tasks through innovative architectural components and efficient training techniques.

Authors:Mingze Dong, Leda Wang, Yuval Kluger
Title: Understanding and Enhancing Mask-Based Pretraining towards Universal Representations
Abstract:
Mask-based pretraining has become a cornerstone of modern large-scale models across language, vision, and recently biology. Despite its empirical success, its role and limits in learning data representations have been unclear. In this work, we show that the behavior of mask-based pretraining can be directly characterized by test risk in high-dimensional minimum-norm ("ridge-less") linear regression, without relying on further model specifications. Further analysis of linear models uncovers several novel aspects of mask-based pretraining. The theoretical framework and its implications have been validated across diverse neural architectures (including MLPs, CNNs, and Transformers) applied to both vision and language tasks. Guided by our theory, we propose an embarrassingly simple yet overlooked pretraining scheme named Randomly Random Mask AutoEncoding (R$^2$MAE), which enforces capturing multi-scale features from data and is able to outperform optimal fixed mask ratio settings in our linear model framework. We implement R$^2$MAE in vision, language, DNA sequence, and single-cell models, where it consistently outperforms standard and more complicated masking schemes, leading to improvements for state-of-the-art models. Our code is available at: https://github.com/MingzeDong/r2mae
中文摘要:基于掩码的预训练通过高维线性回归得到理论解析,由此提出的R²MAE多尺度掩码方法以简驭繁,在多个领域超越现有方案。
English Summary: Mask-based pretraining is theoretically analyzed through high-dimensional linear regression, leading to the development of R²MAE, a simple yet effective multi-scale masking method that outperforms existing approaches across various domains.

Authors:Andreas Burger, Luca Thiede, Nikolaj Rønne, Varinia Bernales, Nandita Vijaykumar, Tejs Vegge, Arghya Bhowmik, Alan Aspuru-Guzik
Title: Shoot from the HIP: Hessian Interatomic Potentials without derivatives
Abstract:
Fundamental tasks in computational chemistry, from transition state search to vibrational analysis, rely on molecular Hessians, which are the second derivatives of the potential energy. Yet, Hessians are computationally expensive to calculate and scale poorly with system size, with both quantum mechanical methods and neural networks. In this work, we demonstrate that Hessians can be predicted directly from a deep learning model, without relying on automatic differentiation or finite differences. We observe that one can construct SE(3)-equivariant, symmetric Hessians from irreducible representations (irrep) features up to degree $l$=2 computed during message passing in graph neural networks. This makes HIP Hessians one to two orders of magnitude faster, more accurate, more memory efficient, easier to train, and enables more favorable scaling with system size. We validate our predictions across a wide range of downstream tasks, demonstrating consistently superior performance for transition state search, accelerated geometry optimization, zero-point energy corrections, and vibrational analysis benchmarks. We open-source the HIP codebase and model weights to enable further development of the direct prediction of Hessians at https://github.com/BurgerAndreas/hip
中文: 本研究提出一种深度学习模型,通过SE(3)等变图神经网络直接预测分子Hessian矩阵,在计算化学任务中实现了速度、精度和可扩展性的显著提升。
English: This research introduces a deep learning model that directly predicts molecular Hessians using SE(3)-equivariant graph neural networks, achieving significant improvements in speed, accuracy, and scalability for computational chemistry tasks.

Authors:Yi Zhu, Heitor R. Guimarães, Arthur Pimentel, Tiago Falk
Title: AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
Abstract:
With the prevalence of artificial intelligence (AI)-generated content, such as audio deepfakes, a large body of recent work has focused on developing deepfake detection techniques. However, most models are evaluated on a narrow set of datasets, leaving their generalization to real-world conditions uncertain. In this paper, we systematically review 28 existing audio deepfake datasets and present an open-source benchmarking toolkit called AUDDT (https://github.com/MuSAELab/AUDDT). The goal of this toolkit is to automate the evaluation of pretrained detectors across these 28 datasets, giving users direct feedback on the advantages and shortcomings of their deepfake detectors. We start by showcasing the usage of the developed toolkit, the composition of our benchmark, and the breakdown of different deepfake subgroups. Next, using a widely adopted pretrained deepfake detector, we present in- and out-of-domain detection results, revealing notable differences across conditions and audio manipulation types. Lastly, we also analyze the limitations of these existing datasets and their gap relative to practical deployment scenarios.
中文: 本文介绍了AUDDT开源工具包,用于在28个数据集上对音频深度伪造检测器进行基准测试,揭示了其性能差异及实际应用中的泛化局限性。
English: This paper introduces AUDDT, an open-source toolkit for benchmarking audio deepfake detectors across 28 datasets, revealing performance variations and limitations in real-world generalization.

Authors:Prasanna Reddy Pulakurthi, Jiamian Wang, Majid Rabbani, Sohail Dianat, Raghuveer Rao, Zhiqiang Tao
Title: X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning
Abstract:
Prevalent text-to-video retrieval systems mainly adopt embedding models for feature extraction and compute cosine similarities for ranking. However, this design presents two limitations. Low-quality text-video data pairs could compromise the retrieval, yet are hard to identify and examine. Cosine similarity alone provides no explanation for the ranking results, limiting the interpretability. We ask that can we interpret the ranking results, so as to assess the retrieval models and examine the text-video data? This work proposes X-CoT, an explainable retrieval framework upon LLM CoT reasoning in place of the embedding model-based similarity ranking. We first expand the existing benchmarks with additional video annotations to support semantic understanding and reduce data bias. We also devise a retrieval CoT consisting of pairwise comparison steps, yielding detailed reasoning and complete ranking. X-CoT empirically improves the retrieval performance and produces detailed rationales. It also facilitates the model behavior and data quality analysis. Code and data are available at: https://github.com/PrasannaPulakurthi/X-CoT.
Chinese: 现有文本-视频检索系统存在数据质量低和结果不可解释的问题,X-CoT通过采用基于大语言模型的推理框架替代传统嵌入模型,不仅提升了检索性能,还能生成详细推理过程。
English: Current text-to-video retrieval systems face limitations from low-quality data and lack of interpretability, which X-CoT addresses by replacing embedding models with an LLM-based reasoning framework that enhances performance and provides detailed explanations.

Authors:Rohan Sanda, Asad Aali, Andrew Johnston, Eduardo Reis, Jonathan Singh, Gordon Wetzstein, Sara Fridovich-Keil
Title: Patch-Based Diffusion for Data-Efficient, Radiologist-Preferred MRI Reconstruction
Abstract:
Magnetic resonance imaging (MRI) requires long acquisition times, raising costs, reducing accessibility, and making scans more susceptible to motion artifacts. Diffusion probabilistic models that learn data-driven priors can potentially assist in reducing acquisition time. However, they typically require large training datasets that can be prohibitively expensive to collect. Patch-based diffusion models have shown promise in learning effective data-driven priors over small real-valued datasets, but have not yet demonstrated clinical value in MRI. We extend the Patch-based Diffusion Inverse Solver (PaDIS) to complex-valued, multi-coil MRI reconstruction, and compare it against a state-of-the-art whole-image diffusion baseline (FastMRI-EDM) for 7x undersampled MRI reconstruction on the FastMRI brain dataset. We show that PaDIS-MRI models trained on small datasets of as few as 25 k-space images outperform FastMRI-EDM on image quality metrics (PSNR, SSIM, NRMSE), pixel-level uncertainty, cross-contrast generalization, and robustness to severe k-space undersampling. In a blinded study with three radiologists, PaDIS-MRI reconstructions were chosen as diagnostically superior in 91.7% of cases, compared to baselines (i) FastMRI-EDM and (ii) classical convex reconstruction with wavelet sparsity. These findings highlight the potential of patch-based diffusion priors for high-fidelity MRI reconstruction in data-scarce clinical settings where diagnostic confidence matters.
中文摘要:基于补丁的扩散模型PaDIS-MRI仅需少量训练数据即可实现高质量加速MRI重建,在诊断准确性和鲁棒性上均优于传统方法。
English Summary: Patch-based diffusion models like PaDIS-MRI enable high-quality, accelerated MRI reconstruction with small training datasets, outperforming conventional methods in diagnostic accuracy and robustness.

Authors:Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, Lifeng Jin
Title: Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Abstract:
Reinforcement fine-tuning (RFT) often suffers from \emph{reward over-optimization}, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs. Our theoretical analysis shows that the key lies in reward misspecification at the high-reward tail: the inability to reliably distinguish Excellent responses from merely Great ones. This motivate us to focus on the high-reward region. However, such tail examples are scarce under the base LLM. While off-policy exemplars (e.g. from stronger models or rewrites) are easier to obtain, naively training on them yields a misspecified reward for the policy we aim to align. To address this, we study rubric-based rewards. By design, rubrics can leverage off-policy examples while remaining insensitive to their artifacts. To elicit rubrics that capture the high-reward tail, we highlight the importance of distinguishing among great and diverse responses, and introduce a workflow to implement this idea. We empirically demonstrate that rubric-based rewards substantially mitigate reward over-optimization and deliver effective LLM post-training improvements. Our code can be accessed at https://github.com/Jun-Kai-Zhang/rubrics.git .
中文: 强化微调常面临奖励过优化问题,模型会利用奖励信号获得高分却输出低质量内容,但基于规则的奖励设计能有效缓解此问题,通过利用非策略示例并避免其伪影,从而提升模型对齐效果。
English: Reinforcement fine-tuning often faces reward over-optimization, where models exploit reward signals to score high despite poor outputs, but using rubric-based rewards effectively mitigates this issue and enhances model alignment by leveraging off-policy examples without succumbing to their artifacts.

Authors:Yuan Gao, Hao Wu, Qingsong Wen, Kun Wang, Xian Wu, Xiaomeng Huang
Title: VISION: Prompting Ocean Vertical Velocity Reconstruction from Incomplete Observations
Abstract:
Reconstructing subsurface ocean dynamics, such as vertical velocity fields, from incomplete surface observations poses a critical challenge in Earth science, a field long hampered by the lack of standardized, analysis-ready benchmarks. To systematically address this issue and catalyze research, we first build and release KD48, a high-resolution ocean dynamics benchmark derived from petascale simulations and curated with expert-driven denoising. Building on this benchmark, we introduce VISION, a novel reconstruction paradigm based on Dynamic Prompting designed to tackle the core problem of missing data in real-world observations. The essence of VISION lies in its ability to generate a visual prompt on-the-fly from any available subset of observations, which encodes both data availability and the ocean's physical state. More importantly, we design a State-conditioned Prompting module that efficiently injects this prompt into a universal backbone, endowed with geometry- and scale-aware operators, to guide its adaptive adjustment of computational strategies. This mechanism enables VISION to precisely handle the challenges posed by varying input combinations. Extensive experiments on the KD48 benchmark demonstrate that VISION not only substantially outperforms state-of-the-art models but also exhibits strong generalization under extreme data missing scenarios. By providing a high-quality benchmark and a robust model, our work establishes a solid infrastructure for ocean science research under data uncertainty. Our codes are available at: https://github.com/YuanGao-YG/VISION.
中文: 本研究提出了高分辨率海洋动力学基准KD48和新型重建模型VISION,该模型通过动态视觉提示机制自适应处理不完整观测数据,在极端数据缺失场景下显著优于现有方法,为海洋科学研究建立了坚实基础。
English: This study introduces KD48, a high-resolution ocean dynamics benchmark, and VISION, a novel reconstruction model that dynamically adapts to incomplete data through visual prompting, significantly outperforming existing methods and enhancing research infrastructure for subsurface ocean analysis.

Authors:Hude Liu, Jerry Yao-Chieh Hu, Jennifer Yuntong Zhang, Zhao Song, Han Liu
Title: Are Hallucinations Bad Estimations?
Abstract:
We formalize hallucinations in generative models as failures to link an estimate to any plausible cause. Under this interpretation, we show that even loss-minimizing optimal estimators still hallucinate. We confirm this with a general high probability lower bound on hallucinate rate for generic data distributions. This reframes hallucination as structural misalignment between loss minimization and human-acceptable outputs, and hence estimation errors induced by miscalibration. Experiments on coin aggregation, open-ended QA, and text-to-image support our theory.
Chinese: 该研究将生成模型中的幻觉重新定义为损失最小化与人类期望之间的结构性错配,证明即使是最优估计器也会因校准误差而产生幻觉。
English: The study redefines hallucinations in generative models as structural misalignment between loss minimization and human expectations, demonstrating that even optimal estimators hallucinate due to estimation errors from miscalibration.

Authors:George Yakushev, Alina Shutova, Ivan Rubachev, Renat Sergazinov, Artem Babenko
Title: Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data
Abstract:
Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly to inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in agentic setup. We design a minimal set of tools for constructing, analyzing and manipulating decision trees. By using these tools, LLMs combine their prior knowledge with learning from data to create a lightweight decision tree that outperforms traditional CART on low-resource tabular problems. While a single decision tree does not outperform state-of-the-art black box models, it comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input: correcting biases or incorporating domain-specific intuition that is not captured in the data.
中文: 本研究提出利用具备推理能力的大语言模型为小型表格数据集生成可解释的决策树,该方法在超越传统CART模型性能的同时,提供透明推理路径并支持人工介入修正偏差。
English: This study proposes using reasoning-capable LLMs to generate interpretable decision trees for small tabular datasets, which outperform traditional CART methods while providing transparent reasoning traces and allowing human intervention to correct biases.

Authors:Anton Konushin, Nikita Drozdov, Bulat Gabdullin, Alexey Zakharov, Anna Vorontsova, Danila Rukhovich, Maksim Kolodiazhnyi
Title: TUN3D: Towards Real-World Scene Understanding from Unposed Images
Abstract:
Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .
中文摘要:TUN3D是首个仅通过多视角图像即可联合实现布局估计与3D物体检测的方法,无需深度传感器或相机位姿真值监督,在多项基准测试中达到了最先进的性能水平。
English Summary: TUN3D is the first method to jointly perform layout estimation and 3D object detection from multi-view images without requiring depth sensors or camera pose supervision, achieving state-of-the-art performance across multiple benchmarks.

Authors:Anja Sheppard, Tyler Smithline, Andrew Scheffer, David Smith, Advaith V. Sethuraman, Ryan Bird, Sabrina Lin, Katherine A. Skinner
Title: ShipwreckFinder: A QGIS Tool for Shipwreck Detection in Multibeam Sonar Data
Abstract:
In this paper, we introduce ShipwreckFinder, an open-source QGIS plugin that detects shipwrecks from multibeam sonar data. Shipwrecks are an important historical marker of maritime history, and can be discovered through manual inspection of bathymetric data. However, this is a time-consuming process and often requires expert analysis. Our proposed tool allows users to automatically preprocess bathymetry data, perform deep learning inference, threshold model outputs, and produce either pixel-wise segmentation masks or bounding boxes of predicted shipwrecks. The backbone of this open-source tool is a deep learning model, which is trained on a variety of shipwreck data from the Great Lakes and the coasts of Ireland. Additionally, we employ synthetic data generation in order to increase the size and diversity of our dataset. We demonstrate superior segmentation performance with our open-source tool and training pipeline as compared to a deep learning-based ArcGIS toolkit and a more classical inverse sinkhole detection method. The open-source tool can be found at https://github.com/umfieldrobotics/ShipwreckFinderQGISPlugin.
Chinese: ShipwreckFinder 是一款开源 QGIS 插件,通过深度学习从多波束声纳数据中自动检测沉船,在分割精度上优于现有方法。
English: ShipwreckFinder is an open-source QGIS plugin that uses deep learning to automatically detect shipwrecks from multibeam sonar data, outperforming existing methods in segmentation accuracy.

Authors:Yinfeng Yu, Hailong Zhang, Meiling Zhu
Title: Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
Abstract:
Audiovisual embodied navigation enables robots to locate audio sources by dynamically integrating visual observations from onboard sensors with the auditory signals emitted by the target. The core challenge lies in effectively leveraging multimodal cues to guide navigation. While prior works have explored basic fusion of visual and audio data, they often overlook deeper perceptual context. To address this, we propose the Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation (DMTF-AVN). Our approach uses a multi-target architecture coupled with a refined Transformer mechanism to filter and selectively fuse cross-modal information. Extensive experiments on the Replica and Matterport3D datasets demonstrate that DMTF-AVN achieves state-of-the-art performance, outperforming existing methods in success rate (SR), path efficiency (SPL), and scene adaptation (SNA). Furthermore, the model exhibits strong scalability and generalizability, paving the way for advanced multimodal fusion strategies in robotic navigation. The code and videos are available at https://github.com/zzzmmm-svg/DMTF.
Chinese: 提出的DMTF-AVN模型通过多目标Transformer架构动态融合视觉与听觉线索,在多个基准数据集上实现了导航精度与适应性的最先进性能。
English: The proposed DMTF-AVN model advances audiovisual navigation by dynamically fusing visual and auditory cues through a multi-target Transformer architecture, achieving state-of-the-art performance in accuracy and adaptability across benchmark datasets.

Authors:Dayu Yang, Hui Fang
Title: ReGeS: Reciprocal Retrieval-Generation Synergy for Conversational Recommender Systems
Abstract:
Connecting conversation with external domain knowledge is vital for conversational recommender systems (CRS) to correctly understand user preferences. However, existing solutions either require domain-specific engineering, which limits flexibility, or rely solely on large language models, which increases the risk of hallucination. While Retrieval-Augmented Generation (RAG) holds promise, its naive use in CRS is hindered by noisy dialogues that weaken retrieval and by overlooked nuances among similar items. We propose ReGeS, a reciprocal Retrieval-Generation Synergy framework that unifies generation-augmented retrieval to distill informative user intent from conversations and retrieval-augmented generation to differentiate subtle item features. This synergy obviates the need for extra annotations, reduces hallucinations, and simplifies continuous updates. Experiments on multiple CRS benchmarks show that ReGeS achieves state-of-the-art performance in recommendation accuracy, demonstrating the effectiveness of reciprocal synergy for knowledge-intensive CRS tasks.
Chinese: ReGeS框架通过检索与生成的协同作用,从对话中提炼用户意图并区分细微物品特征,无需额外标注且减少幻觉,在多个基准测试中实现了最先进的推荐准确性。
English: The ReGeS framework introduces a reciprocal synergy between retrieval and generation to enhance conversational recommender systems by distilling user intent and differentiating item features, achieving state-of-the-art accuracy without extra annotations or hallucinations.

Authors:Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, Linpeng Huang
Title: Influence Guided Context Selection for Effective Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) addresses large language model (LLM) hallucinations by grounding responses in external knowledge, but its effectiveness is compromised by poor-quality retrieved contexts containing irrelevant or noisy information. While existing approaches attempt to improve performance through context selection based on predefined context quality assessment metrics, they show limited gains over standard RAG. We attribute this limitation to their failure in holistically utilizing available information (query, context list, and generator) for comprehensive quality assessment. Inspired by recent advances in data selection, we reconceptualize context quality assessment as an inference-time data valuation problem and introduce the Contextual Influence Value (CI value). This novel metric quantifies context quality by measuring the performance degradation when removing each context from the list, effectively integrating query-aware relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI value eliminates complex selection hyperparameter tuning by simply retaining contexts with positive CI values. To address practical challenges of label dependency and computational overhead, we develop a parameterized surrogate model for CI value prediction during inference. The model employs a hierarchical architecture that captures both local query-context relevance and global inter-context interactions, trained through oracle CI value supervision and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and multiple LLMs demonstrate that our context selection method significantly outperforms state-of-the-art baselines, effectively filtering poor-quality contexts while preserving critical information. Code is available at https://github.com/SJTU-DMTai/RAG-CSM.
中文: 检索增强生成(RAG)通过引入外部知识减少大语言模型的幻觉,但其效果常受低质量检索上下文的制约。为此,研究者提出上下文影响力值(CI值)这一新指标,通过量化移除上下文导致的性能下降来综合评估质量,无需复杂参数调整即可有效过滤劣质上下文,在多种自然语言处理任务中显著优于现有方法。
English: Retrieval-Augmented Generation (RAG) mitigates LLM hallucinations by incorporating external knowledge, yet its efficacy is hindered by low-quality retrieved contexts. To address this, the authors introduce the Contextual Influence Value (CI value), a novel metric that holistically assesses context quality by measuring performance degradation upon removal, enabling effective filtering without complex parameter tuning and significantly outperforming existing methods across diverse NLP tasks.

Authors:Huizhe Zhang, Jintang Li, Yuchang Zhu, Liang Chen, Li Kuang
Title: SGNNBench: A Holistic Evaluation of Spiking Graph Neural Network on Large-scale Graph
Abstract:
Graph Neural Networks (GNNs) are exemplary deep models designed for graph data. Message passing mechanism enables GNNs to effectively capture graph topology and push the performance boundaries across various graph tasks. However, the trend of developing such complex machinery for graph representation learning has become unsustainable on large-scale graphs. The computational and time overhead make it imperative to develop more energy-efficient GNNs to cope with the explosive growth of real-world graphs. Spiking Graph Neural Networks (SGNNs), which integrate biologically plausible learning via unique spike-based neurons, have emerged as a promising energy-efficient alternative. Different layers communicate with sparse and binary spikes, which facilitates computation and storage of intermediate graph representations. Despite the proliferation of SGNNs proposed in recent years, there is no systematic benchmark to explore the basic design principles of these brain-inspired networks on the graph data. To bridge this gap, we present SGNNBench to quantify progress in the field of SGNNs. Specifically, SGNNBench conducts an in-depth investigation of SGNNs from multiple perspectives, including effectiveness, energy efficiency, and architectural design. We comprehensively evaluate 9 state-of-the-art SGNNs across 18 datasets. Regarding efficiency, we empirically compare these baselines w.r.t model size, memory usage, and theoretical energy consumption to reveal the often-overlooked energy bottlenecks of SGNNs. Besides, we elaborately investigate the design space of SGNNs to promote the development of a general SGNN paradigm.
中文: 图神经网络在大规模图数据上计算开销不可持续,因此出现了利用脉冲进行高效计算的节能型脉冲图神经网络,但缺乏系统基准促使SGNNBench的建立,从性能、能效和架构设计多角度进行全面评估。
English: Graph Neural Networks face unsustainable computational demands on large-scale graphs, prompting the emergence of energy-efficient Spiking Graph Neural Networks (SGNNs) that use binary spikes for efficient processing, though a lack of systematic benchmarking led to the creation of SGNNBench for comprehensive evaluation across effectiveness, efficiency, and design.

Authors:Jiahao Zhang, Wenzhe Yin, Shujian Yu
Title: Cross-Modal Retrieval with Cauchy-Schwarz Divergence
Abstract:
Effective cross-modal retrieval requires robust alignment of heterogeneous data types. Most existing methods focus on bi-modal retrieval tasks and rely on distributional alignment techniques such as Kullback-Leibler divergence, Maximum Mean Discrepancy, and correlation alignment. However, these methods often suffer from critical limitations, including numerical instability, sensitivity to hyperparameters, and their inability to capture the full structure of the underlying distributions. In this paper, we introduce the Cauchy-Schwarz (CS) divergence, a hyperparameter-free measure that improves both training stability and retrieval performance. We further propose a novel Generalized CS (GCS) divergence inspired by Hölder's inequality. This extension enables direct alignment of three or more modalities within a unified mathematical framework through a bidirectional circular comparison scheme, eliminating the need for exhaustive pairwise comparisons. Extensive experiments on six benchmark datasets demonstrate the effectiveness of our method in both bi-modal and tri-modal retrieval tasks. The code of our CS/GCS divergence is publicly available at https://github.com/JiahaoZhang666/CSD.
中文: 本文提出柯西-施瓦茨散度及其广义形式,通过无超参数且稳定的方法在统一框架中对齐多模态,解决了跨模态检索中的关键局限。
English: This paper introduces the Cauchy-Schwarz divergence and its generalized version to address limitations in cross-modal retrieval by providing a hyperparameter-free, stable method that effectively aligns multiple modalities within a unified framework.

Authors:Guohang Yan, Yue Zhang, Pinlong Cai, Ding Wang, Song Mao, Hongwei Zhang, Yaoze Zhang, Hairong Zhang, Xinyu Cai, Botian Shi
Title: HetaRAG: Hybrid Deep Retrieval-Augmented Generation across Heterogeneous Data Stores
Abstract:
Retrieval-augmented generation (RAG) has become a dominant paradigm for mitigating knowledge hallucination and staleness in large language models (LLMs) while preserving data security. By retrieving relevant evidence from private, domain-specific corpora and injecting it into carefully engineered prompts, RAG delivers trustworthy responses without the prohibitive cost of fine-tuning. Traditional retrieval-augmented generation (RAG) systems are text-only and often rely on a single storage backend, most commonly a vector database. In practice, this monolithic design suffers from unavoidable trade-offs: vector search captures semantic similarity yet loses global context; knowledge graphs excel at relational precision but struggle with recall; full-text indexes are fast and exact yet semantically blind; and relational engines such as MySQL provide strong transactional guarantees but no semantic understanding. We argue that these heterogeneous retrieval paradigms are complementary, and propose a principled fusion scheme to orchestrate them synergistically, mitigating the weaknesses of any single modality. In this work we introduce HetaRAG, a hybrid, deep-retrieval augmented generation framework that orchestrates cross-modal evidence from heterogeneous data stores. We plan to design a system that unifies vector indices, knowledge graphs, full-text engines, and structured databases into a single retrieval plane, dynamically routing and fusing evidence to maximize recall, precision, and contextual fidelity. To achieve this design goal, we carried out preliminary explorations and constructed an initial RAG pipeline; this technical report provides a brief overview. The partial code is available at https://github.com/KnowledgeXLab/HetaRAG.
中文: 检索增强生成(RAG)通过整合多源数据证据提升大语言模型的可靠性,HetaRAG提出混合框架协同融合向量索引、知识图谱及其他数据库,以提高精确率和召回率。
English: Retrieval-augmented generation (RAG) enhances LLM reliability by integrating evidence from multiple data sources, and HetaRAG proposes a hybrid framework to synergistically combine vector indices, knowledge graphs, and other databases for improved precision and recall.

Authors:Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai
Title: SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Abstract:
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.
我们提出了一个科学推理基础模型,它将自然语言与多种科学数据格式对齐,通过大规模语料预训练和精细调优技术,实现了跨科学工作流的准确多任务处理能力,并在覆盖范围、泛化性和保真度方面超越了专业系统。
This scientific reasoning foundation model integrates natural language with diverse scientific data formats, trained on a massive corpus and refined through advanced techniques to enable accurate, multi-task capabilities across various scientific workflows while outperforming specialized systems in coverage, generalization, and fidelity.

Authors:Yu Yuan, Xijun Wang, Tharindu Wickremasinghe, Zeeshan Nadir, Bole Ma, Stanley H. Chan
Title: NewtonGen: Physics-Consistent and Controllable Text-to-Video Generation via Neural Newtonian Dynamics
Abstract:
A primary bottleneck in large-scale text-to-video generation today is physical consistency and controllability. Despite recent advances, state-of-the-art models often produce unrealistic motions, such as objects falling upward, or abrupt changes in velocity and direction. Moreover, these models lack precise parameter control, struggling to generate physically consistent dynamics under different initial conditions. We argue that this fundamental limitation stems from current models learning motion distributions solely from appearance, while lacking an understanding of the underlying dynamics. In this work, we propose NewtonGen, a framework that integrates data-driven synthesis with learnable physical principles. At its core lies trainable Neural Newtonian Dynamics (NND), which can model and predict a variety of Newtonian motions, thereby injecting latent dynamical constraints into the video generation process. By jointly leveraging data priors and dynamical guidance, NewtonGen enables physically consistent video synthesis with precise parameter control.
中文摘要:现有文本生成视频模型存在物理一致性和可控性不足的问题,牛顿生成框架通过引入可训练的神经牛顿动力学,将物理规律融入生成过程,实现了具备精确参数控制的物理一致性视频合成。
English Summary: Current text-to-video models struggle with physical consistency and controllability, so NewtonGen introduces Neural Newtonian Dynamics to integrate learnable physics principles, enabling physically accurate video generation with precise parameter control.

Authors:Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Title: Quantized Visual Geometry Grounded Transformer
Abstract:
Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.
中文: 本文提出QuantVGGT量化框架,通过双平滑细粒度量化和噪声过滤多样性采样技术,有效解决了十亿级视觉几何变换器量化中的重尾分布和校准稳定性问题,在保持98%以上重建精度的同时实现了3.7倍内存压缩和2.5倍推理加速。
English: This paper introduces QuantVGGT, a novel quantization framework that addresses the challenges of compressing billion-scale Visual Geometry Grounded Transformers through dual-smoothed fine-grained quantization and noise-filtered diverse sampling, achieving significant memory reduction and acceleration while maintaining high reconstruction accuracy.

Authors:Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, Yongjun Xu
Title: Quantized Visual Geometry Grounded Transformer
Abstract:
Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have made remarkable progress with the use of large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has become a common practice for compressing and accelerating models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first Quantization framework for VGGTs, namely QuantVGGT. This mainly relies on two technical contributions: First, we introduce Dual-Smoothed Fine-Grained Quantization, which integrates pre-global Hadamard rotation and post-local channel smoothing to mitigate heavy-tailed distributions and inter-channel variance robustly. Second, we design Noise-Filtered Diverse Sampling, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a 3.7$\times$ memory reduction and 2.5$\times$ acceleration in real-hardware inference, while maintaining reconstruction accuracy above 98\% of its full-precision counterpart. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios. Our code is released in https://github.com/wlfeng0509/QuantVGGT.
中文: 本文提出QuantVGGT量化框架,通过双平滑细粒度量化和噪声过滤多样性采样技术,有效解决了十亿级视觉几何变换器量化中的重尾分布和校准稳定性问题,在保持98%以上重建精度的同时实现了3.7倍内存压缩和2.5倍推理加速。
English: This paper introduces QuantVGGT, a novel quantization framework that addresses the challenges of compressing billion-scale Visual Geometry Grounded Transformers through dual-smoothed fine-grained quantization and noise-filtered diverse sampling, achieving significant memory reduction and acceleration while maintaining high reconstruction accuracy.

Authors:Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu
Title: MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Abstract:
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
中文摘要:本研究针对多模态推理模型的局限性,提出方差感知采样方法以稳定强化学习优化,并发布了大规模精选数据集及可复现的训练资源。
English Summary: This research addresses limitations in multimodal reasoning models by introducing Variance-Aware Sampling to stabilize reinforcement learning optimization and releasing large-scale curated datasets with reproducible training resources.

Authors:Zijian Shao, Haiyang Shen, Mugeng Liu, Gecheng Fu, Yaoqi Guo, Yanfeng Wang, Yun Ma
Title: Grounding AI Explanations in Experience: A Reflective Cognitive Architecture for Clinical Decision Support
Abstract:
Effective disease prediction in modern healthcare demands the twin goals of high accuracy and transparent, clinically meaningful explanations. Existing machine learning and large language model (LLM) based approaches often struggle to balance these goals. Many models yield accurate but unclear statistical outputs, while others generate fluent but statistically unsupported narratives, often undermining both the validity of the explanation and the predictive accuracy itself. This shortcoming comes from a shallow interaction with the data, preventing the development of a deep, detailed understanding similar to a human expert's. We argue that high accuracy and high-quality explanations are not separate objectives but are mutually reinforcing outcomes of a model that develops a deep, direct understanding of the data. To achieve this, we propose the Reflective Cognitive Architecture (RCA), a novel framework that coordinates multiple LLMs to learn from direct experience. RCA features an iterative rule refinement mechanism that improves its logic from prediction errors and a distribution-aware rules check mechanism that bases its reasoning in the dataset's global statistics. By using predictive accuracy as a signal to drive deeper comprehension, RCA builds a strong internal model of the data. We evaluated RCA on one private and two public datasets against 22 baselines. The results demonstrate that RCA not only achieves state-of-the-art accuracy and robustness with a relative improvement of up to 40\% over the baseline but, more importantly, leverages this deep understanding to excel in generating explanations that are clear, logical, evidence-based, and balanced, highlighting its potential for creating genuinely trustworthy clinical decision support systems. The code is available at \https://github.com/ssssszj/RCA.
Chinese: 反射认知架构(RCA)是一种新颖框架,通过协调多个大语言模型,利用迭代规则优化和分布感知推理机制,在实现顶尖预测精度的同时生成清晰可信的临床解释,为构建真正可靠的医疗决策系统提供了突破性方案。
English: The Reflective Cognitive Architecture (RCA) is a novel framework that coordinates multiple LLMs to achieve both state-of-the-art predictive accuracy and high-quality, evidence-based explanations by developing a deep understanding of data through iterative rule refinement and distribution-aware reasoning.

Authors:Xinyu Liu, Guolei Sun, Cheng Wang, Yixuan Yuan, Ender Konukoglu
Title: MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation
Abstract:
High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.
中文:提出的MedVSR框架通过跨状态空间传播解决特征对齐问题,并结合内部状态空间重建增强组织结构,有效应对医学视频超分辨率中的独特挑战,在多种医疗场景中展现出卓越性能。
English: The proposed MedVSR framework addresses unique challenges in medical video super-resolution, such as alignment difficulties and artifacts, through Cross State-Space Propagation for feature alignment and Inner State-Space Reconstruction for enhancing tissue structures, demonstrating superior performance across diverse medical scenarios.

Authors:Babak Salamat, Dominik Mattern, Sebastian-Sven Olzem, Gerhard Elsbacher, Christian Seidel, Andrea M. Tonello
Title: \LARGE GMP$^{3}$: Learning-Driven, Bellman-Guided Trajectory Planning for UAVs in Real-Time on SE(3)
Abstract:
We propose $\text{GMP}^{3}$, a multiphase global path planning framework that generates dynamically feasible three-dimensional trajectories for unmanned aerial vehicles (UAVs) operating in cluttered environments. The framework extends traditional path planning from Euclidean position spaces to the Lie group $\mathrm{SE}(3)$, allowing joint learning of translational motion and rotational dynamics. A modified Bellman-based operator is introduced to support reinforcement learning (RL) policy updates while leveraging prior trajectory information for improved convergence. $\text{GMP}^{3}$ is designed as a distributed framework in which agents influence each other and share policy information along the trajectory: each agent refines its assigned segment and shares with its neighbors via a consensus-based scheme, enabling cooperative policy updates and convergence toward a path shaped globally even under kinematic constraints. We also propose DroneManager, a modular ground control software that interfaces the planner with real UAV platforms via the MAVLink protocol, supporting real-time deployment and feedback. Simulation studies and indoor flight experiments validate the effectiveness of the proposed method in constrained 3D environments, demonstrating reliable obstacle avoidance and smooth, feasible trajectories across both position and orientation. The open-source implementation is available at https://github.com/Domattee/DroneManager
中文:GMP³框架通过SE(3)李群上的多阶段全局路径规划,使无人机能在复杂环境中生成动态可行的三维轨迹,结合强化学习和分布式智能体协作提升收敛性,并通过DroneManager软件实现实际部署验证。
English: The GMP³ framework enables UAVs to generate dynamically feasible 3D trajectories in cluttered environments through multiphase global path planning on the SE(3) Lie group, incorporating reinforcement learning and distributed agent cooperation for improved convergence and real-world deployment via DroneManager software.

Authors:Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček
Title: humancompatible.train: Implementing Optimization Algorithms for Stochastically-Constrained Stochastic Optimization Problems
Abstract:
There has been a considerable interest in constrained training of deep neural networks (DNNs) recently for applications such as fairness and safety. Several toolkits have been proposed for this task, yet there is still no industry standard. We present humancompatible.train (https://github.com/humancompatible/train), an easily-extendable PyTorch-based Python package for training DNNs with stochastic constraints. We implement multiple previously unimplemented algorithms for stochastically constrained stochastic optimization. We demonstrate the toolkit use by comparing two algorithms on a deep learning task with fairness constraints.
中文: 针对深度神经网络的约束训练在公平性和安全性等应用领域日益受到关注,为此开发了humancompatible.train这一可扩展的PyTorch工具包,它实现了随机约束优化的新算法,并在公平性约束任务中展示了其应用价值。
English: There is growing interest in constrained training of deep neural networks for applications like fairness and safety, leading to the development of humancompatible.train, an extendable PyTorch-based toolkit that implements novel algorithms for stochastically constrained optimization and demonstrates their use in fairness-constrained tasks.

Authors:Benedikt Hoock, Tobias Köppl
Title: Data-driven Neural Networks for Windkessel Parameter Calibration
Abstract:
In this work, we propose a novel method for calibrating Windkessel (WK) parameters in a dimensionally reduced 1D-0D coupled blood flow model. To this end, we design a data-driven neural network (NN)trained on simulated blood pressures in the left brachial artery. Once trained, the NN emulates the pressure pulse waves across the entire simulated domain, i.e., over time, space and varying WK parameters, with negligible error and computational effort. To calibrate the WK parameters on a measured pulse wave, the NN is extended by dummy neurons and retrained only on these. The main objective of this work is to assess the effectiveness of the method in various scenarios -- particularly, when the exact measurement location is unknown or the data are affected by noise.
中文: 本研究提出了一种基于神经网络的创新方法,用于在血流模型中高效标定Windkessel参数,并验证了该方法在测量位置不确定和数据含噪声情况下的强健性。
English: This study introduces a neural network-based method for efficiently calibrating Windkessel parameters in blood flow models, demonstrating its robustness under uncertain measurement locations and noisy data conditions.

Authors:Benedikt Hoock, Tobias Köppl
Title: Data-driven Neural Networks for Windkessel Parameter Calibration
Abstract:
In this work, we propose a novel method for calibrating Windkessel (WK) parameters in a dimensionally reduced 1D-0D coupled blood flow model. To this end, we design a data-driven neural network (NN)trained on simulated blood pressures in the left brachial artery. Once trained, the NN emulates the pressure pulse waves across the entire simulated domain, i.e., over time, space and varying WK parameters, with negligible error and computational effort. To calibrate the WK parameters on a measured pulse wave, the NN is extended by dummy neurons and retrained only on these. The main objective of this work is to assess the effectiveness of the method in various scenarios -- particularly, when the exact measurement location is unknown or the data are affected by noise.
中文: 本研究提出了一种基于神经网络的创新方法,用于在血流模型中高效标定Windkessel参数,并验证了该方法在测量位置不确定和数据含噪声情况下的强健性。
English: This study introduces a neural network-based method for efficiently calibrating Windkessel parameters in blood flow models, demonstrating its robustness under uncertain measurement locations and noisy data conditions.

Authors:Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen
Title: A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Abstract:
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.
中文摘要:该研究揭示了单次处理大语言模型在多跳问答中的能力瓶颈,提出了名为InfoQA的多轮调用框架,通过能力感知的任务分解保持单步准确性,并在高难度基准测试中实现了稳定性能提升。
English Summary: The study identifies a capacity bottleneck in single-pass LLMs for multi-hop question answering, proposing a multi-call framework called InfoQA that maintains accuracy through capacity-aware task decomposition and achieves robust performance on a challenging benchmark.

Authors:Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, Di Jin
Title: Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
Abstract:
Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden "tool tax" of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3\% accuracy -- the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5\% and agent steps by 43.7\%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85\% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.
中文: 该框架通过融合隐式检索与结构化协作,克服了大型语言模型中显式检索和均匀聚合的低效问题,在显著降低计算成本的同时实现了最优准确率。
English: This framework overcomes the inefficiencies of explicit retrieval and uniform aggregation in LLMs by integrating implicit retrieval with structured collaboration, achieving state-of-the-art accuracy while significantly reducing computational costs.

Authors:Jacob Fein-Ashley, Dhruv Parikh, Rajgopal Kannan, Viktor Prasanna
Title: Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say
Abstract:
Open-source Large Language Models (LLMs) increasingly specialize by domain (e.g., math, code, general reasoning), motivating systems that leverage complementary strengths across models. Prior multi-LLM approaches either (i) route a query to one or a few experts and generate independently, (ii) aggregate outputs from each model via costly multi-turn exchanges, or (iii) fuse weights into a single model-typically requiring architectural homogeneity. We introduce Mixture of Thoughts (MoT), a simple method for latent-level collaboration among heterogeneous experts under a global routing scheme. For each query, a lightweight router selects top-$K$ experts and designates a primary expert; uniformly placed interaction layers project hidden states into a shared latent space where the primary expert performs cross-attention over its active (selected) peers. Pre-trained experts remain frozen; only the router and the lightweight interaction layers are trained with a novel joint training objective that improves both the expert selection and inter-expert collaboration. Across five in-distribution (ID) and three out-of-distribution (OOD) benchmarks, MoT surpasses the current routing and aggregation-based state-of-the-art, Avengers, by $+0.38\%$ and $+2.92\%$, respectively. Further, MoT significantly outperforms the best-performing single model. It achieves this with single-pass inference, runtime comparable to routing baselines, and none of the overheads of iterative aggregation. MoT offers a simple latent-space mechanism for combining heterogeneous LLMs, a practical step toward broader multi-LLM collaboration. Our code is publicly available at https://github.com/jacobfa/mot.
中文:Mixture of Thoughts (MoT)方法通过轻量级路由选择专家并在共享潜在空间中进行交互,实现了异构大语言模型的高效协作,以单次推理和低开销超越了现有最优方法。
English: The Mixture of Thoughts (MoT) method enables efficient collaboration among diverse large language models by using a lightweight router to select experts and facilitate latent-level interactions in a shared space, achieving superior performance over existing approaches with single-pass inference and minimal overhead.

Authors:Killian Steunou, Sigurd Saue, Théo Druilhe
Title: Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
Abstract:
Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.
中文: 本研究证明,采用稀疏主成分分析(SPCA)作为防御机制,通过稀疏投影降低输入敏感性,能有效提升神经网络对抗攻击的鲁棒性,在保持竞争力的准确率同时,在白盒与黑盒攻击下均优于标准PCA方法。
English: This study demonstrates that using sparse principal component analysis (SPCA) as a defense mechanism enhances neural network robustness against adversarial attacks by reducing input sensitivity through sparser projections, maintaining competitive accuracy while outperforming standard PCA under various attack scenarios.

Authors:Killian Steunou, Théo Druilhe, Sigurd Saue
Title: Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
Abstract:
Deep neural networks perform remarkably well on image classification tasks but remain vulnerable to carefully crafted adversarial perturbations. This work revisits linear dimensionality reduction as a simple, data-adapted defense. We empirically compare standard Principal Component Analysis (PCA) with its sparse variant (SPCA) as front-end feature extractors for downstream classifiers, and we complement these experiments with a theoretical analysis. On the theory side, we derive exact robustness certificates for linear heads applied to SPCA features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink, where $W$ is the projection and $u$ the head weights. We further show that for general (non-linear) heads, sparsity reduces operator-norm bounds through a Lipschitz composition argument, predicting lower input sensitivity. Empirically, with a small non-linear network after the projection, SPCA consistently degrades more gracefully than PCA under strong white-box and black-box attacks while maintaining competitive clean accuracy. Taken together, the theory identifies the mechanism (sparser projections reduce adversarial leverage) and the experiments verify that this benefit persists beyond the linear setting. Our code is available at https://github.com/killian31/SPCARobustness.
中文: 本研究证明,采用稀疏主成分分析(SPCA)作为防御机制,通过稀疏投影降低输入敏感性,能有效提升神经网络对抗攻击的鲁棒性,在保持竞争力的准确率同时,在白盒与黑盒攻击下均优于标准PCA方法。
English: This study demonstrates that using sparse principal component analysis (SPCA) as a defense mechanism enhances neural network robustness against adversarial attacks by reducing input sensitivity through sparser projections, maintaining competitive accuracy while outperforming standard PCA under various attack scenarios.

Authors:Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
Title: TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Abstract:
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.
中文: 大型语言模型作为自动评估器时存在评分比较和成对传递性不一致的问题,TrustJudge通过概率框架有效减少了这些不一致性,并提高了评估准确性。
English: The adoption of LLMs as automated evaluators reveals critical inconsistencies in current frameworks, which TrustJudge addresses through a probabilistic approach that reduces score-comparison and pairwise transitivity inconsistencies while improving evaluation accuracy.

Authors:Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
Title: TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Abstract:
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.
中文: 大型语言模型作为自动评估器时存在评分比较和成对传递性不一致的问题,TrustJudge通过概率框架有效减少了这些不一致性,并提高了评估准确性。
English: The adoption of LLMs as automated evaluators reveals critical inconsistencies in current frameworks, which TrustJudge addresses through a probabilistic approach that reduces score-comparison and pairwise transitivity inconsistencies while improving evaluation accuracy.

Authors:Suaiba Amina Salahuddin, Teresa Dorszewski, Marit Almenning Martiniussen, Tone Hovda, Antonio Portaluri, Solveig Thrun, Michael Kampffmeyer, Elisabeth Wetzer, Kristoffer Wickstrøm, Robert Jenssen
Title: Mammo-CLIP Dissect: A Framework for Analysing Mammography Concepts in Vision-Language Models
Abstract:
Understanding what deep learning (DL) models learn is essential for the safe deployment of artificial intelligence (AI) in clinical settings. While previous work has focused on pixel-based explainability methods, less attention has been paid to the textual concepts learned by these models, which may better reflect the reasoning used by clinicians. We introduce Mammo-CLIP Dissect, the first concept-based explainability framework for systematically dissecting DL vision models trained for mammography. Leveraging a mammography-specific vision-language model (Mammo-CLIP) as a "dissector," our approach labels neurons at specified layers with human-interpretable textual concepts and quantifies their alignment to domain knowledge. Using Mammo-CLIP Dissect, we investigate three key questions: (1) how concept learning differs between DL vision models trained on general image datasets versus mammography-specific datasets; (2) how fine-tuning for downstream mammography tasks affects concept specialisation; and (3) which mammography-relevant concepts remain underrepresented. We show that models trained on mammography data capture more clinically relevant concepts and align more closely with radiologists' workflows than models not trained on mammography data. Fine-tuning for task-specific classification enhances the capture of certain concept categories (e.g., benign calcifications) but can reduce coverage of others (e.g., density-related features), indicating a trade-off between specialisation and generalisation. Our findings show that Mammo-CLIP Dissect provides insights into how convolutional neural networks (CNNs) capture mammography-specific knowledge. By comparing models across training data and fine-tuning regimes, we reveal how domain-specific training and task-specific adaptation shape concept learning. Code and concept set are available: https://github.com/Suaiba/Mammo-CLIP-Dissect.
Chinese: Mammo-CLIP Dissect 提出了一种基于概念的可解释性框架,揭示了经过乳腺摄影数据训练的深度学习模型比通用模型更能捕捉临床相关概念,同时强调了在微调过程中专业化与泛化之间的权衡。
English: Mammo-CLIP Dissect introduces a concept-based explainability framework that reveals how deep learning models trained on mammography data capture clinically relevant concepts more effectively than general models, while highlighting the trade-off between specialization and generalization during fine-tuning.

Authors:Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu
Title: ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
Abstract:
Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.
中文: ScaleDiff是一种高效且成本低廉的流程,通过自适应思维模型筛选现有数据集中的难题并训练专门生成器,无需昂贵资源即可大规模创建高难度数学问题,显著提升模型在复杂推理任务中的表现。
English: ScaleDiff is a cost-effective pipeline that automates the creation of challenging mathematical problems by filtering existing datasets with an adaptive thinking model and training a specialized generator, significantly boosting model performance on difficult benchmarks without expensive resources.

Authors:Zhen Liu, Yongtao Zhang, Shaobo Ren, Yuxin You
Title: Structure-Attribute Transformations with Markov Chain Boost Graph Domain Adaptation
Abstract:
Graph domain adaptation has gained significant attention in label-scarce scenarios across different graph domains. Traditional approaches to graph domain adaptation primarily focus on transforming node attributes over raw graph structures and aligning the distributions of the transformed node features across networks. However, these methods often struggle with the underlying structural heterogeneity between distinct graph domains, which leads to suboptimal distribution alignment. To address this limitation, we propose Structure-Attribute Transformation with Markov Chain (SATMC), a novel framework that sequentially aligns distributions across networks via both graph structure and attribute transformations. To mitigate the negative influence of domain-private information and further enhance the model's generalization, SATMC introduces a private domain information reduction mechanism and an empirical Wasserstein distance. Theoretical proofs suggest that SATMC can achieve a tighter error bound for cross-network node classification compared to existing graph domain adaptation methods. Extensive experiments on nine pairs of publicly available cross-domain datasets show that SATMC outperforms state-of-the-art methods in the cross-network node classification task. The code is available at https://github.com/GiantZhangYT/SATMC.
Chinese: SATMC框架通过结构和属性转换对齐分布,解决了图域适应中的结构异质性问题,在跨网络节点分类任务中表现优异。
English: The SATMC framework addresses structural heterogeneity in graph domain adaptation by aligning distributions through structure and attribute transformations, achieving superior performance in cross-network node classification.

Authors:Jiahao Huo, Shuliang Liu, Bin Wang, Junyan Zhang, Yibo Yan, Aiwei Liu, Xuming Hu, Mingxun Zhou
Title: PMark: Towards Robust and Distortion-free Semantic-level Watermarking with Channel Constraints
Abstract:
Semantic-level watermarking (SWM) for large language models (LLMs) enhances watermarking robustness against text modifications and paraphrasing attacks by treating the sentence as the fundamental unit. However, existing methods still lack strong theoretical guarantees of robustness, and reject-sampling-based generation often introduces significant distribution distortions compared with unwatermarked outputs. In this work, we introduce a new theoretical framework on SWM through the concept of proxy functions (PFs) $\unicode{x2013}$ functions that map sentences to scalar values. Building on this framework, we propose PMark, a simple yet powerful SWM method that estimates the PF median for the next sentence dynamically through sampling while enforcing multiple PF constraints (which we call channels) to strengthen watermark evidence. Equipped with solid theoretical guarantees, PMark achieves the desired distortion-free property and improves the robustness against paraphrasing-style attacks. We also provide an empirically optimized version that further removes the requirement for dynamical median estimation for better sampling efficiency. Experimental results show that PMark consistently outperforms existing SWM baselines in both text quality and robustness, offering a more effective paradigm for detecting machine-generated text. Our code will be released at [this URL](https://github.com/PMark-repo/PMark).
Chinese: PMark通过代理函数理论框架提出了一种无失真的语义水印方法,增强了抗转述攻击的鲁棒性,并在文本质量和检测效果上优于现有基准方法。
English: PMark introduces a theoretical framework using proxy functions to create a distortion-free semantic watermarking method for LLMs, enhancing robustness against paraphrasing attacks and outperforming existing baselines in text quality and detection effectiveness.

Authors:Songyue Cai, Zongqian Wu, Yujie Mo, Liang Peng, Ping Hu, Xiaoshuang Shi, Xiaofeng Zhu
Title: Background Prompt for Few-Shot Out-of-Distribution Detection
Abstract:
Existing foreground-background (FG-BG) decomposition methods for the few-shot out-of-distribution (FS-OOD) detection often suffer from low robustness due to over-reliance on the local class similarity and a fixed background patch extraction strategy. To address these challenges, we propose a new FG-BG decomposition framework, namely Mambo, for FS-OOD detection. Specifically, we propose to first learn a background prompt to obtain the local background similarity containing both the background and image semantic information, and then refine the local background similarity using the local class similarity. As a result, we use both the refined local background similarity and the local class similarity to conduct background extraction, reducing the dependence of the local class similarity in previous methods. Furthermore, we propose the patch self-calibrated tuning to consider the sample diversity to flexibly select numbers of background patches for different samples, and thus exploring the issue of fixed background extraction strategies in previous methods. Extensive experiments on real-world datasets demonstrate that our proposed Mambo achieves the best performance, compared to SOTA methods in terms of OOD detection and near OOD detection setting. The source code will be released at https://github.com/YuzunoKawori/Mambo.
中文:提出的Mambo框架通过结合精炼的局部背景相似性与类别相似性,并采用补丁自校准调整实现自适应背景块选择,显著提升了少样本分布外检测的鲁棒性和准确性,优于现有方法。
English: The proposed Mambo framework enhances few-shot out-of-distribution detection by integrating refined local background similarity with class similarity and employing patch self-calibrated tuning for adaptive background patch selection, outperforming existing methods in robustness and accuracy.

Authors:Junu Kim, Xiao Liu, Zhenghao Lin, Lei Ji, Yeyun Gong, Edward Choi
Title: Behind RoPE: How Does Causal Mask Encode Positional Information?
Abstract:
While explicit positional encodings such as RoPE are a primary source of positional information in Transformer decoders, the causal mask also provides positional information. In this work, we prove that the causal mask can induce position-dependent patterns in attention scores, even without parameters or causal dependency in the input. Our theoretical analysis indicates that the induced attention pattern tends to favor nearby query-key pairs, mirroring the behavior of common positional encodings. Empirical analysis confirms that trained models exhibit the same behavior, with learned parameters further amplifying these patterns. Notably, we found that the interaction of causal mask and RoPE distorts RoPE's relative attention score patterns into non-relative ones. We consistently observed this effect in modern large language models, suggesting the importance of considering the causal mask as a source of positional information alongside explicit positional encodings.
中文摘要:因果掩码在Transformer解码器中能独立产生偏向局部交互的位置相关注意力模式,其与RoPE等显式位置编码的交互会扭曲相对注意力机制,表明必须将因果掩码视为与显式位置编码同等重要的位置信息来源。
English Summary: The causal mask in Transformer decoders inherently creates position-dependent attention patterns that favor local interactions, and its interaction with explicit positional encodings like RoPE distorts relative attention into non-relative patterns, highlighting the need to treat causal masks as significant positional information sources.

Authors:Rubaiyat Tasnim Chowdhury, Nayan Bala, Ronojoy Roy, Tarek Mahmud
Title: BactoBot: A Low-Cost, Bacteria-Inspired Soft Underwater Robot for Marine Exploration
Abstract:
Traditional rigid underwater vehicles pose risks to delicate marine ecosystems. This paper presents BactoBot, a low-cost, soft underwater robot designed for safe and gentle marine exploration. Inspired by bacterial flagellar propulsion, BactoBot features 12 flexible, silicone-based arms arranged on a 3D-printed dodecahedral frame. The design provides inherent compliance, redundancy, and the potential for omnidirectional movement. The prototype was fabricated using accessible DIY methods, including food-grade silicone molding, 3D printing, and off-the-shelf microcontrollers. Waterproofing and buoyancy calibration protocols were developed, and the robot was successfully tested in a controlled water tank, demonstrating forward motion and turning. The results validate the feasibility of replicating complex biological locomotion at low cost. The project lays a foundation for environmentally conscious robotic tools, particularly for marine science in resource-constrained settings, and identifies pathways toward autonomous operation and field deployment.
中文: 本文介绍了BactoBot,一种受细菌鞭毛启发的低成本软体水下机器人,采用柔性硅胶臂设计,旨在实现安全的海洋探索,并已在受控环境中成功完成运动测试。
English: This paper introduces BactoBot, an affordable soft underwater robot inspired by bacterial propulsion, designed with flexible silicone arms for safe marine exploration and successfully tested for movement in controlled environments.

Authors:Sarmistha Das, R E Zera Marveen Lyngkhoi, Sriparna Saha, Alka Maurya
Title: Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos
Abstract:
The dynamic propagation of social media has broadened the reach of financial advisory content through podcast videos, yet extracting insights from lengthy, multimodal segments (30-40 minutes) remains challenging. We introduce FASTER (Financial Advisory Summariser with Textual Embedded Relevant images), a modular framework that tackles three key challenges: (1) extracting modality-specific features, (2) producing optimized, concise summaries, and (3) aligning visual keyframes with associated textual points. FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features. A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency against the human-aligned summary. A ranker-based retrieval mechanism further aligns keyframes with summarized content, enhancing interpretability and cross-modal coherence. To acknowledge data resource scarcity, we introduce Fin-APT, a dataset comprising 470 publicly accessible financial advisory pep-talk videos for robust multimodal research. Comprehensive cross-domain experiments confirm FASTER's strong performance, robustness, and generalizability when compared to Large Language Models (LLMs) and Vision-Language Models (VLMs). By establishing a new standard for multimodal summarization, FASTER makes financial advisory content more accessible and actionable, thereby opening new avenues for research. The dataset and code are available at: https://github.com/sarmistha-D/FASTER
中文摘要:FASTER框架通过整合多模态特征提取、优化摘要生成和视觉文本对齐,解决了冗长金融咨询视频的摘要难题,其综合测试表现优于现有模型。
English Summary: The FASTER framework addresses the challenge of summarizing lengthy financial advisory videos by integrating multimodal feature extraction, optimized summarization, and visual-text alignment, demonstrating superior performance over existing models through comprehensive testing.

Authors:Wenhao Tang, Heng Fang, Ge Wu, Xiang Li, Ming-Ming Cheng
Title: Revisiting Data Challenges of Computational Pathology: A Pack-based Multiple Instance Learning Framework
Abstract:
Computational pathology (CPath) digitizes pathology slides into whole slide images (WSIs), enabling analysis for critical healthcare tasks such as cancer diagnosis and prognosis. However, WSIs possess extremely long sequence lengths (up to 200K), significant length variations (from 200 to 200K), and limited supervision. These extreme variations in sequence length lead to high data heterogeneity and redundancy. Conventional methods often compromise on training efficiency and optimization to preserve such heterogeneity under limited supervision. To comprehensively address these challenges, we propose a pack-based MIL framework. It packs multiple sampled, variable-length feature sequences into fixed-length ones, enabling batched training while preserving data heterogeneity. Moreover, we introduce a residual branch that composes discarded features from multiple slides into a hyperslide which is trained with tailored labels. It offers multi-slide supervision while mitigating feature loss from sampling. Meanwhile, an attention-driven downsampler is introduced to compress features in both branches to reduce redundancy. By alleviating these challenges, our approach achieves an accuracy improvement of up to 8% while using only 12% of the training time in the PANDA(UNI). Extensive experiments demonstrate that focusing data challenges in CPath holds significant potential in the era of foundation models. The code is https://github.com/FangHeng/PackMIL
中文: 该研究提出的基于打包的多示例学习框架通过将可变长度特征序列打包为固定长度实现高效批量训练,同时引入残差分支和注意力下采样器来增强监督并减少冗余,在仅用12%训练时间的情况下实现了最高8%的准确率提升。
English: The proposed pack-based MIL framework addresses computational pathology challenges by packing variable-length feature sequences into fixed-length ones for efficient batched training, while introducing a residual branch and attention-driven downsampler to enhance supervision and reduce redundancy, achieving up to 8% accuracy improvement with only 12% training time.

Authors:Kairui Fu, Tao Zhang, Shuwen Xiao, Ziyang Wang, Xinming Zhang, Chenchi Zhang, Yuliang Yan, Junjun Zheng, Yu Li, Zhihong Chen, Jian Wu, Xiangheng Kong, Shengyu Zhang, Kun Kuang, Yuning Jiang, Bo Zheng
Title: FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets
Abstract:
Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on our platform, which serves over 300 million users daily, reveals a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at https://github.com/selous123/al_sid.
中文:FORGE基准通过提供大规模工业数据集和优化策略,解决了生成式检索中语义标识符面临的关键挑战,改进了SID构建,实现了无需完整训练的高效评估,并加速了实际应用中的在线收敛。
English: The FORGE benchmark addresses key challenges in semantic identifiers for generative retrieval by providing a large-scale industrial dataset and optimization strategies, which improve SID construction, enable efficient evaluation without full training, and accelerate online convergence in real-world applications.

Authors:Kairui Fu, Tao Zhang, Shuwen Xiao, Ziyang Wang, Xinming Zhang, Chenchi Zhang, Yuliang Yan, Junjun Zheng, Yu Li, Zhihong Chen, Jian Wu, Xiangheng Kong, Shengyu Zhang, Kun Kuang, Yuning Jiang, Bo Zheng
Title: FORGE: Forming Semantic Identifiers for Generative Retrieval in Industrial Datasets
Abstract:
Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) due to their meaningful semantic discriminability. However, current research on SIDs faces three main challenges: (1) the absence of large-scale public datasets with multimodal features, (2) limited investigation into optimization strategies for SID generation, which typically rely on costly GR training for evaluation, and (3) slow online convergence in industrial deployment. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieR in Generative rEtrieval with industrial datasets. Specifically, FORGE is equipped with a dataset comprising 14 billion user interactions and multimodal features of 250 million items sampled from Taobao, one of the biggest e-commerce platforms in China. Leveraging this dataset, FORGE explores several optimizations to enhance the SID construction and validates their effectiveness via offline experiments across different settings and tasks. Further online analysis conducted on the "Guess You Like" section of Taobao's homepage shows a 0.35% increase in transaction count, highlighting the practical impact of our method. Regarding the expensive SID validation accompanied by the full training of GRs, we propose two novel metrics of SID that correlate positively with recommendation performance, enabling convenient evaluations without any GR training. For real-world applications, FORGE introduces an offline pretraining schema that reduces online convergence by half. The code and data are available at https://github.com/selous123/al_sid.
中文:FORGE基准通过提供大规模工业数据集和优化策略,解决了生成式检索中语义标识符面临的关键挑战,改进了SID构建,实现了无需完整训练的高效评估,并加速了实际应用中的在线收敛。
English: The FORGE benchmark addresses key challenges in semantic identifiers for generative retrieval by providing a large-scale industrial dataset and optimization strategies, which improve SID construction, enable efficient evaluation without full training, and accelerate online convergence in real-world applications.

Authors:Zhifei Li, Feng Qiu, Yiran Wang, Yujing Xia, Kui Xiao, Miao Zhang, Yan Zhang
Title: Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Abstract:
Visual Question Answering (VQA) presents a unique challenge by requiring models to understand and reason about visual content to answer questions accurately. Existing VQA models often struggle with biases introduced by the training data, leading to over-reliance on superficial patterns and inadequate generalization to diverse questions and images. This paper presents a novel model, IOG-VQA, which integrates Object Interaction Self-Attention and GAN-Based Debiasing to enhance VQA model performance. The self-attention mechanism allows our model to capture complex interactions between objects within an image, providing a more comprehensive understanding of the visual context. Meanwhile, the GAN-based debiasing framework generates unbiased data distributions, helping the model to learn more robust and generalizable features. By leveraging these two components, IOG-VQA effectively combines visual and textual information to address the inherent biases in VQA datasets. Extensive experiments on the VQA-CP v1 and VQA-CP v2 datasets demonstrate that our model shows excellent performance compared with the existing methods, particularly in handling biased and imbalanced data distributions highlighting the importance of addressing both object interactions and dataset biases in advancing VQA tasks. Our code is available at https://github.com/HubuKG/IOG-VQA.
Chinese Summary: IOG-VQA模型通过结合物体交互自注意力机制和基于GAN的去偏方法,有效提升了视觉问答性能,在标准数据集上展现出卓越的泛化能力和抗偏置特性。
English Summary: The IOG-VQA model enhances Visual Question Answering by integrating object interaction self-attention for better visual context understanding and GAN-based debiasing to mitigate dataset biases, achieving superior performance on benchmark datasets.

Authors:Yan Zhang, Jiaqing Lin, Miao Zhang, Kui Xiao, Xiaoju Hou, Yue Zhao, Zhifei Li
Title: SCRA-VQA: Summarized Caption-Rerank for Augmented Large Language Models in Visual Question Answering
Abstract:
Acquiring high-quality knowledge is a central focus in Knowledge-Based Visual Question Answering (KB-VQA). Recent methods use large language models (LLMs) as knowledge engines for answering. These methods generally employ image captions as visual text descriptions to assist LLMs in interpreting images. However, the captions frequently include excessive noise irrelevant to the question, and LLMs generally do not comprehend VQA tasks, limiting their reasoning capabilities. To address this issue, we propose the Summarized Caption-Rerank Augmented VQA (SCRA-VQA), which employs a pre-trained visual language model to convert images into captions. Moreover, SCRA-VQA generates contextual examples for the captions while simultaneously summarizing and reordering them to exclude unrelated information. The caption-rerank process enables LLMs to understand the image information and questions better, thus enhancing the model's reasoning ability and task adaptability without expensive end-to-end training. Based on an LLM with 6.7B parameters, SCRA-VQA performs excellently on two challenging knowledge-based VQA datasets: OK-VQA and A-OKVQA, achieving accuracies of 38.8% and 34.6%. Our code is available at https://github.com/HubuKG/SCRA-VQA.
中文:SCRA-VQA通过总结和重排图像描述来减少噪声,使大型语言模型能更好地理解图像并提升推理能力,无需昂贵训练即可在知识型视觉问答任务中取得优异表现。
English: SCRA-VQA enhances knowledge-based visual question answering by summarizing and reranking image captions to reduce noise, enabling large language models to better interpret images and improve reasoning without costly training, achieving high accuracy on challenging datasets.

Authors:Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei
Title: StyleBench: Evaluating thinking styles in Large Language Models
Abstract:
The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.
中文: 大型语言模型的有效性取决于推理策略,没有单一风格普遍最优,因为性能因模型规模和任务类型而异,其中搜索类方法在开放性问题中表现突出,而简洁风格在明确任务中显著提升效率。
English: The effectiveness of Large Language Models depends on reasoning strategies, with no single style universally optimal, as performance varies by model scale and task type, where search-based methods excel in open-ended problems and concise styles boost efficiency in well-defined tasks.

Authors:Xiaonan Hu, Xuebing Li, Jinyu Xu, Abdulkadir Duran Adan, Letian Zhou, Xuhui Zhu, Yanan Li, Wei Guo, Shouyang Liu, Wenzhong Liu, Hao Lu
Title: TasselNetV4: A vision foundation model for cross-scene, cross-scale, and cross-species plant counting
Abstract:
Accurate plant counting provides valuable information for agriculture such as crop yield prediction, plant density assessment, and phenotype quantification. Vision-based approaches are currently the mainstream solution. Prior art typically uses a detection or a regression model to count a specific plant. However, plants have biodiversity, and new cultivars are increasingly bred each year. It is almost impossible to exhaust and build all species-dependent counting models. Inspired by class-agnostic counting (CAC) in computer vision, we argue that it is time to rethink the problem formulation of plant counting, from what plants to count to how to count plants. In contrast to most daily objects with spatial and temporal invariance, plants are dynamic, changing with time and space. Their non-rigid structure often leads to worse performance than counting rigid instances like heads and cars such that current CAC and open-world detection models are suboptimal to count plants. In this work, we inherit the vein of the TasselNet plant counting model and introduce a new extension, TasselNetV4, shifting from species-specific counting to cross-species counting. TasselNetV4 marries the local counting idea of TasselNet with the extract-and-match paradigm in CAC. It builds upon a plain vision transformer and incorporates novel multi-branch box-aware local counters used to enhance cross-scale robustness. Two challenging datasets, PAC-105 and PAC-Somalia, are harvested. Extensive experiments against state-of-the-art CAC models show that TasselNetV4 achieves not only superior counting performance but also high efficiency.Our results indicate that TasselNetV4 emerges to be a vision foundation model for cross-scene, cross-scale, and cross-species plant counting.
中文: TasselNetV4通过将植物计数从物种特定模型转向跨物种方法,结合视觉变换器和多分支局部计数器,在多种农业场景中实现了卓越的计数精度与效率。
English: TasselNetV4 advances plant counting by transitioning from species-specific models to a cross-species approach, leveraging a vision transformer and multi-branch local counters to achieve superior accuracy and efficiency across diverse agricultural scenarios.

Authors:Keitaro Sakamoto, Issei Sato
Title: Explaining Grokking and Information Bottleneck through Neural Collapse Emergence
Abstract:
The training dynamics of deep neural networks often defy expectations, even as these models form the foundation of modern machine learning. Two prominent examples are grokking, where test performance improves abruptly long after the training loss has plateaued, and the information bottleneck principle, where models progressively discard input information irrelevant to the prediction task as training proceeds. However, the mechanisms underlying these phenomena and their relations remain poorly understood. In this work, we present a unified explanation of such late-phase phenomena through the lens of neural collapse, which characterizes the geometry of learned representations. We show that the contraction of population within-class variance is a key factor underlying both grokking and information bottleneck, and relate this measure to the neural collapse measure defined on the training set. By analyzing the dynamics of neural collapse, we show that distinct time scales between fitting the training set and the progression of neural collapse account for the behavior of the late-phase phenomena. Finally, we validate our theoretical findings on multiple datasets and architectures.
中文: 本研究通过神经坍缩视角统一解释了训练后期现象如顿悟和信息瓶颈,揭示了类内方差收缩是这些现象的关键机制,并在多个数据集和架构上验证了理论发现。
English: This study provides a unified explanation for late-phase training phenomena like grokking and the information bottleneck through neural collapse, showing that the contraction of within-class variance underlies these behaviors and validating the findings across datasets and architectures.

Authors:Songze Li, Zhiqiang Liu, Zhengke Gui, Huajun Chen, Wen Zhang
Title: Enrich-on-Graph: Query-Graph Alignment for Complex Reasoning with LLM Enriching
Abstract:
Large Language Models (LLMs) exhibit strong reasoning capabilities in complex tasks. However, they still struggle with hallucinations and factual errors in knowledge-intensive scenarios like knowledge graph question answering (KGQA). We attribute this to the semantic gap between structured knowledge graphs (KGs) and unstructured queries, caused by inherent differences in their focuses and structures. Existing methods usually employ resource-intensive, non-scalable workflows reasoning on vanilla KGs, but overlook this gap. To address this challenge, we propose a flexible framework, Enrich-on-Graph (EoG), which leverages LLMs' prior knowledge to enrich KGs, bridge the semantic gap between graphs and queries. EoG enables efficient evidence extraction from KGs for precise and robust reasoning, while ensuring low computational costs, scalability, and adaptability across different methods. Furthermore, we propose three graph quality evaluation metrics to analyze query-graph alignment in KGQA task, supported by theoretical validation of our optimization objectives. Extensive experiments on two KGQA benchmark datasets indicate that EoG can effectively generate high-quality KGs and achieve the state-of-the-art performance. Our code and data are available at https://github.com/zjukg/Enrich-on-Graph.
中文摘要:Enrich-on-Graph框架利用大语言模型的先验知识增强知识图谱,弥合图谱与查询间的语义鸿沟,在知识图谱问答任务中以高效可扩展的方式实现了最优性能。
English Summary: The Enrich-on-Graph framework enhances knowledge graphs using LLMs' prior knowledge to bridge the semantic gap with queries, achieving state-of-the-art performance in KGQA with improved efficiency and scalability.

Authors:Shihua Huang, Yongjie Hou, Longfei Liu, Xuanlong Yu, Xi Shen
Title: Real-Time Object Detection Meets DINOv3
Abstract:
Benefiting from the simplicity and effectiveness of Dense O2O and MAL, DEIM has become the mainstream training framework for real-time DETRs, significantly outperforming the YOLO series. In this work, we extend it with DINOv3 features, resulting in DEIMv2. DEIMv2 spans eight model sizes from X to Atto, covering GPU, edge, and mobile deployment. For the X, L, M, and S variants, we adopt DINOv3-pretrained or distilled backbones and introduce a Spatial Tuning Adapter (STA), which efficiently converts DINOv3's single-scale output into multi-scale features and complements strong semantics with fine-grained details to enhance detection. For ultra-lightweight models (Nano, Pico, Femto, and Atto), we employ HGNetv2 with depth and width pruning to meet strict resource budgets. Together with a simplified decoder and an upgraded Dense O2O, this unified design enables DEIMv2 to achieve a superior performance-cost trade-off across diverse scenarios, establishing new state-of-the-art results. Notably, our largest model, DEIMv2-X, achieves 57.8 AP with only 50.3 million parameters, surpassing prior X-scale models that require over 60 million parameters for just 56.5 AP. On the compact side, DEIMv2-S is the first sub-10 million model (9.71 million) to exceed the 50 AP milestone on COCO, reaching 50.9 AP. Even the ultra-lightweight DEIMv2-Pico, with just 1.5 million parameters, delivers 38.5 AP, matching YOLOv10-Nano (2.3 million) with around 50 percent fewer parameters. Our code and pre-trained models are available at https://github.com/Intellindust-AI-Lab/DEIMv2
中文: DEIMv2通过融合DINOv3特征和空间调谐适配器,在八个模型尺寸上实现了卓越性能,以更少参数创下实时检测新纪录。
English: DEIMv2 enhances the DEIM framework by integrating DINOv3 features and a Spatial Tuning Adapter, achieving superior performance across eight model sizes with state-of-the-art results in real-time detection while reducing parameter counts.

Authors:Chenhui Hu, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Title: Towards Atoms of Large Language Models
Abstract:
The fundamental units of internal representations in large language models (LLMs) remain undefined, limiting further understanding of their mechanisms. Neurons or features are often regarded as such units, yet neurons suffer from polysemy, while features face concerns of unreliable reconstruction and instability. To address this issue, we propose the Atoms Theory, which defines such units as atoms. We introduce the atomic inner product (AIP) to correct representation shifting, formally define atoms, and prove the conditions that atoms satisfy the Restricted Isometry Property (RIP), ensuring stable sparse representations over atom set and linking to compressed sensing. Under stronger conditions, we further establish the uniqueness and exact $\ell_1$ recoverability of the sparse representations, and provide guarantees that single-layer sparse autoencoders (SAEs) with threshold activations can reliably identify the atoms. To validate the Atoms Theory, we train threshold-activated SAEs on Gemma2-2B, Gemma2-9B, and Llama3.1-8B, achieving 99.9% sparse reconstruction across layers on average, and more than 99.8% of atoms satisfy the uniqueness condition, compared to 0.5% for neurons and 68.2% for features, showing that atoms more faithfully capture intrinsic representations of LLMs. Scaling experiments further reveal the link between SAEs size and recovery capacity. Overall, this work systematically introduces and validates Atoms Theory of LLMs, providing a theoretical framework for understanding internal representations and a foundation for mechanistic interpretability. Code available at https://github.com/ChenhuiHu/towards_atoms.
中文: 本文提出原子理论,将原子定义为大语言模型内部表征的基本单元,通过在Gemma2和Llama3.1等模型上的理论证明与实验验证,展示了原子相比神经元和特征具有更优的稳定性与唯一性。
English: This paper introduces the Atoms Theory, defining atoms as the fundamental units of internal representations in large language models and demonstrating their superior stability and uniqueness over neurons and features through theoretical proofs and empirical validation on models like Gemma2 and Llama3.1.

Authors:Hyomin Choi, Heeji Han, Chris Rosewarne, Fabien Racapé
Title: CompressAI-Vision: Open-source software to evaluate compression methods for computer vision tasks
Abstract:
With the increasing use of neural network (NN)-based computer vision applications that process image and video data as input, interest has emerged in video compression technology optimized for computer vision tasks. In fact, given the variety of vision tasks, associated NN models and datasets, a consolidated platform is needed as a common ground to implement and evaluate compression methods optimized for downstream vision tasks. CompressAI-Vision is introduced as a comprehensive evaluation platform where new coding tools compete to efficiently compress the input of vision network while retaining task accuracy in the context of two different inference scenarios: "remote" and "split" inferencing. Our study showcases various use cases of the evaluation platform incorporated with standard codecs (under development) by examining the compression gain on several datasets in terms of bit-rate versus task accuracy. This evaluation platform has been developed as open-source software and is adopted by the Moving Pictures Experts Group (MPEG) for the development the Feature Coding for Machines (FCM) standard. The software is available publicly at https://github.com/InterDigitalInc/CompressAI-Vision.
中文: 随着基于神经网络的计算机视觉应用日益增多,CompressAI-Vision作为一个开源评估平台被推出,用于测试在远程和分离推理场景下保持任务准确性的视频压缩方法,现已被MPEG采纳用于开发FCM标准。
English: With the rise of neural network-based computer vision applications, CompressAI-Vision is introduced as an open-source platform to evaluate video compression methods that maintain task accuracy in remote and split inference scenarios, now adopted by MPEG for the FCM standard.

Authors:Yuxuan Zhou, Xingxing Li, Shengyu Li, Zhuohao Yan, Chunxi Xia, Shaoquan Feng
Title: MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM
Abstract:
Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (https://github.com/GREAT-WHU/MASt3R-Fusion).
中文摘要:MASt3R-Fusion创新地将神经网络点云回归与惯性/GNSS数据通过分层因子图紧密融合,构建了能够实时进行姿态跟踪和全局一致建图的多传感器视觉SLAM系统,显著提升了精度与鲁棒性。
English Summary: MASt3R-Fusion is a novel multi-sensor visual SLAM framework that integrates neural pointmap regression with inertial and GNSS data through a hierarchical factor graph, achieving enhanced accuracy and robustness in real-time mapping and pose tracking.

Authors:Yuxuan Zhou, Xingxing Li, Shengyu Li, Zhuohao Yan, Chunxi Xia, Shaoquan Feng
Title: MASt3R-Fusion: Integrating Feed-Forward Visual Model with IMU, GNSS for High-Functionality SLAM
Abstract:
Visual SLAM is a cornerstone technique in robotics, autonomous driving and extended reality (XR), yet classical systems often struggle with low-texture environments, scale ambiguity, and degraded performance under challenging visual conditions. Recent advancements in feed-forward neural network-based pointmap regression have demonstrated the potential to recover high-fidelity 3D scene geometry directly from images, leveraging learned spatial priors to overcome limitations of traditional multi-view geometry methods. However, the widely validated advantages of probabilistic multi-sensor information fusion are often discarded in these pipelines. In this work, we propose MASt3R-Fusion,a multi-sensor-assisted visual SLAM framework that tightly integrates feed-forward pointmap regression with complementary sensor information, including inertial measurements and GNSS data. The system introduces Sim(3)-based visualalignment constraints (in the Hessian form) into a universal metric-scale SE(3) factor graph for effective information fusion. A hierarchical factor graph design is developed, which allows both real-time sliding-window optimization and global optimization with aggressive loop closures, enabling real-time pose tracking, metric-scale structure perception and globally consistent mapping. We evaluate our approach on both public benchmarks and self-collected datasets, demonstrating substantial improvements in accuracy and robustness over existing visual-centered multi-sensor SLAM systems. The code will be released open-source to support reproducibility and further research (https://github.com/GREAT-WHU/MASt3R-Fusion).
中文摘要:MASt3R-Fusion创新地将神经网络点云回归与惯性/GNSS数据通过分层因子图紧密融合,构建了能够实时进行姿态跟踪和全局一致建图的多传感器视觉SLAM系统,显著提升了精度与鲁棒性。
English Summary: MASt3R-Fusion is a novel multi-sensor visual SLAM framework that integrates neural pointmap regression with inertial and GNSS data through a hierarchical factor graph, achieving enhanced accuracy and robustness in real-time mapping and pose tracking.

Authors:Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, Yuguang Fang
Title: Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection
Abstract:
Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings.The code is available at https://github.com/gy65896/Neptune-X.
中文: Neptune-X框架通过X-to-Maritime生成模型合成多样化海事场景,并采用任务感知样本选择机制,有效提升了海事目标检测的准确性,尤其在具有挑战性的场景中表现突出。
English: Neptune-X is a data-centric generative-selection framework that enhances maritime object detection by synthesizing diverse scenes with its X-to-Maritime model and dynamically selecting task-relevant samples, significantly improving accuracy in challenging conditions.

Authors:Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, Yuguang Fang
Title: Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection
Abstract:
Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings. The code is available at https://github.com/gy65896/Neptune-X.
中文: Neptune-X框架通过X-to-Maritime生成模型合成多样化海事场景,并采用任务感知样本选择机制,有效提升了海事目标检测的准确性,尤其在具有挑战性的场景中表现突出。
English: Neptune-X is a data-centric generative-selection framework that enhances maritime object detection by synthesizing diverse scenes with its X-to-Maritime model and dynamically selecting task-relevant samples, significantly improving accuracy in challenging conditions.

Authors:Zhenshan Zhang, Xueping Zhang, Yechen Wang, Liwei Jin, Ming Li
Title: The Impact of Audio Watermarking on Audio Anti-Spoofing Countermeasures
Abstract:
This paper presents the first study on the impact of audio watermarking on spoofing countermeasures. While anti-spoofing systems are essential for securing speech-based applications, the influence of widely used audio watermarking, originally designed for copyright protection, remains largely unexplored. We construct watermark-augmented training and evaluation datasets, named the Watermark-Spoofing dataset, by applying diverse handcrafted and neural watermarking methods to existing anti-spoofing datasets. Experiments show that watermarking consistently degrades anti-spoofing performance, with higher watermark density correlating with higher Equal Error Rates (EERs). To mitigate this, we propose the Knowledge-Preserving Watermark Learning (KPWL) framework, enabling models to adapt to watermark-induced shifts while preserving their original-domain spoofing detection capability. These findings reveal audio watermarking as a previously overlooked domain shift and establish the first benchmark for developing watermark-resilient anti-spoofing systems. All related protocols are publicly available at https://github.com/Alphawarheads/Watermark_Spoofing.git
中文: 本研究首次揭示音频水印会显著降低反欺骗系统的性能,并提出知识保留水印学习框架,在维持检测能力的同时有效缓解水印带来的负面影响。
English: This study reveals that audio watermarking significantly degrades anti-spoofing performance and proposes a Knowledge-Preserving Watermark Learning framework to mitigate this impact while maintaining detection capabilities.

Authors:Yuan Chiang, Tobias Kreiman, Christine Zhang, Matthew C. Kuner, Elizabeth Weaver, Ishan Amin, Hyunsoo Park, Yunsung Lim, Jihan Kim, Daryl Chrzan, Aron Walsh, Samuel M. Blau, Mark Asta, Aditi S. Krishnapriyan
Title: MLIP Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform
Abstract:
Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.
中文: MLIP Arena 是一个新型基准平台,通过物理感知和实际性能指标评估机器学习原子间势,克服现有基准的局限性,为开发更精准高效的模型提供可复现框架。
English: MLIP Arena is a new benchmark platform that evaluates machine learning interatomic potentials through physics awareness and real-world performance metrics, addressing limitations of current benchmarks by providing a reproducible framework for developing more accurate and efficient models.

Authors:Eric Fithian, Kirill Skobelev
Title: DELM: a Python toolkit for Data Extraction with Language Models
Abstract:
Large Language Models (LLMs) have become powerful tools for annotating unstructured data. However, most existing workflows rely on ad hoc scripts, making reproducibility, robustness, and systematic evaluation difficult. To address these challenges, we introduce DELM (Data Extraction with Language Models), an open-source Python toolkit designed for rapid experimental iteration of LLM-based data extraction pipelines and for quantifying the trade-offs between them. DELM minimizes boilerplate code and offers a modular framework with structured outputs, built-in validation, flexible data-loading and scoring strategies, and efficient batch processing. It also includes robust support for working with LLM APIs, featuring retry logic, result caching, detailed cost tracking, and comprehensive configuration management. We showcase DELM's capabilities through two case studies: one featuring a novel prompt optimization algorithm, and another illustrating how DELM quantifies trade-offs between cost and coverage when selecting keywords to decide which paragraphs to pass to an LLM. DELM is available at \href{https://github.com/Center-for-Applied-AI/delm}{\texttt{github.com/Center-for-Applied-AI/delm}}.
中文: DELM是一个开源Python工具包,通过提供模块化框架、内置验证、批处理和API支持来优化基于大语言模型的数据提取流程,同时支持系统评估成本与性能的权衡。
English: DELM is an open-source Python toolkit that streamlines LLM-based data extraction pipelines by providing a modular framework with built-in validation, batch processing, and API support, while enabling systematic evaluation of cost and performance trade-offs.

Authors:Maria Chiper, Radu Tudor Ionescu
Title: Every Character Counts: From Vulnerability to Defense in Phishing Detection
Abstract:
Phishing attacks targeting both organizations and individuals are becoming an increasingly significant threat as technology advances. Current automatic detection methods often lack explainability and robustness in detecting new phishing attacks. In this work, we investigate the effectiveness of character-level deep learning models for phishing detection, which can provide both robustness and interpretability. We evaluate three neural architectures adapted to operate at the character level, namely CharCNN, CharGRU, and CharBiLSTM, on a custom-built email dataset, which combines data from multiple sources. Their performance is analyzed under three scenarios: (i) standard training and testing, (ii) standard training and testing under adversarial attacks, and (iii) training and testing with adversarial examples. Aiming to develop a tool that operates as a browser extension, we test all models under limited computational resources. In this constrained setup, CharGRU proves to be the best-performing model across all scenarios. All models show vulnerability to adversarial attacks, but adversarial training substantially improves their robustness. In addition, by adapting the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to character-level inputs, we are able to visualize which parts of each email influence the decision of each model. Our open-source code and data is released at https://github.com/chipermaria/every-character-counts.
中文: 本研究评估了字符级深度学习模型在钓鱼检测中的效果,发现CharGRU在计算资源受限时表现最佳,尽管所有模型均易受对抗攻击,但对抗训练能显著提升鲁棒性,并通过改进的Grad-CAM技术实现了决策过程的可视化。
English: This study evaluates character-level deep learning models for phishing detection, finding CharGRU most effective under computational constraints while demonstrating vulnerability to adversarial attacks that can be mitigated through adversarial training and model interpretability via Grad-CAM adaptation.

Authors:Xiao Wang, Jia Wang, Yijie Wang, Pengtao Dang, Sha Cao, Chi Zhang
Title: MARS: toward more efficient multi-agent collaboration for LLM reasoning
Abstract:
Large language models (LLMs) have achieved impressive results in natural language understanding, yet their reasoning capabilities remain limited when operating as single agents. Multi-Agent Debate (MAD) has been proposed to address this limitation by enabling collaborative reasoning among multiple models in a round-table debate manner. While effective, MAD introduces substantial computational overhead due to the number of agents involved and the frequent communication required. In this paper, we propose MARS (Multi-Agent Review System), a role-based collaboration framework inspired by the review process. In MARS, an author agent generates an initial solution, reviewer agents provide decisions and comments independently, and a meta-reviewer integrates the feedback to make the final decision and guide further revision. This design enhances reasoning quality while avoiding costly reviewer-to-reviewer interactions, thereby controlling token consumption and inference time. We compared MARS with both MAD and other state-of-the-art reasoning strategies across multiple benchmarks. Extensive experiments with different LLMs show that MARS matches the accuracy of MAD while reducing both token usage and inference time by approximately 50\%. Code is available at https://github.com/xwang97/MARS.
Chinese: MARS框架通过基于角色的评审流程增强大语言模型的推理能力,在保持与多智能体辩论同等准确率的同时,将令牌使用量和推理时间减少约50%。
English: The MARS framework enhances reasoning in large language models through a role-based review process, matching the accuracy of Multi-Agent Debate while cutting token usage and inference time by half.

Authors:Tue Do, Varun Chandrasekaran, Daniel Alabi
Title: Efficiently Attacking Memorization Scores
Abstract:
Influence estimation tools -- such as memorization scores -- are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our attack (calculating the pseudoinverse of the input) is practical, requiring only black-box access to model outputs and incur modest computational overhead. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://github.com/tuedo2/MemAttack
中文: 本研究证明基于记忆的影响评估工具易受实际对抗攻击,仅需少量计算成本即可操纵评分,揭示了影响归因系统固有的脆弱性。
English: This study demonstrates that memorization-based influence estimators are vulnerable to practical adversarial attacks, which can manipulate scores with minimal computational cost, revealing inherent fragility in influence-based attribution systems.

Authors:Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Arta Yapeter, Ilya Stanevich, Felipe Perez, Jesse C. Cresswell
Title: Document Summarization with Conformal Importance Guarantees
Abstract:
Automatic summarization systems have advanced rapidly with large language models (LLMs), yet they still lack reliable guarantees on inclusion of critical content in high-stakes domains like healthcare, law, and finance. In this work, we introduce Conformal Importance Summarization, the first framework for importance-preserving summary generation which uses conformal prediction to provide rigorous, distribution-free coverage guarantees. By calibrating thresholds on sentence-level importance scores, we enable extractive document summarization with user-specified coverage and recall rates over critical content. Our method is model-agnostic, requires only a small calibration set, and seamlessly integrates with existing black-box LLMs. Experiments on established summarization benchmarks demonstrate that Conformal Importance Summarization achieves the theoretically assured information coverage rate. Our work suggests that Conformal Importance Summarization can be combined with existing techniques to achieve reliable, controllable automatic summarization, paving the way for safer deployment of AI summarization tools in critical applications. Code is available at https://github.com/layer6ai-labs/conformal-importance-summarization.
中文: 本文提出"保形重要性摘要"框架,通过保形预测为自动摘要系统提供严格的关键内容覆盖保证,可在医疗、法律等高风险领域实现更安全可靠的部署。
English: This paper presents Conformal Importance Summarization, a novel framework that uses conformal prediction to provide rigorous coverage guarantees for preserving critical content in automatic summarization, enabling safer deployment in high-stakes domains.

Authors:Zhe Shen
Title: The First Open-Source Framework for Learning Stability Certificate from Data
Abstract:
Before 2025, no open-source system existed that could learn Lyapunov stability certificates directly from noisy, real-world flight data. No tool could answer the critical question: is this controller still stabilizable-especially when its closed-loop system is a total black box. We broke that boundary. This year, we released the first-ever open-source framework that can learn Lyapunov functions from trajectory data under realistic, noise-corrupted conditions. Unlike statistical anomaly detectors, our method does not merely flag deviations-it directly determines whether the system can still be proven stable. Applied to public data from the 2024 SAS severe turbulence incident, our method revealed that, within just 60 seconds of the aircrafts descent becoming abnormal, no Lyapunov function could be constructed to certify system stability. Moreover, this is the first known data-driven stability-theoretic method ever applied to a civil airliner accident. And our approach works with zero access to the controller logic-a breakthrough for commercial aircraft where control laws are proprietary and opaque. The implementation of the proposed framework is open-sourced and available at: https://github.com/HansOersted/stability
中文: 2025年前,尚无开源系统能从噪声飞行数据中学习李雅普诺夫稳定性证书或评估黑盒系统的控制器可稳性,但今年首个此类框架问世,在2024年SAS湍流事件中,飞机异常下降60秒内即检测到失稳,且无需访问控制器逻辑。
English: Before 2025, no open-source system could learn Lyapunov stability certificates from noisy real-world flight data or assess controller stabilizability for black-box systems, but this year, the first such framework was released, successfully detecting instability within 60 seconds of abnormal descent in the 2024 SAS turbulence incident without requiring controller logic access.

Authors:Haoxuan Li, Zhen Wen, Qiqi Jiang, Chenxiao Li, Yuwei Wu, Yuchen Yang, Yiyao Wang, Xiuqi Huang, Minfeng Zhu, Wei Chen
Title: ConceptViz: A Visual Analytics Approach for Exploring Concepts in Large Language Models
Abstract:
Large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks. Understanding how LLMs internally represent knowledge remains a significant challenge. Despite Sparse Autoencoders (SAEs) have emerged as a promising technique for extracting interpretable features from LLMs, SAE features do not inherently align with human-understandable concepts, making their interpretation cumbersome and labor-intensive. To bridge the gap between SAE features and human concepts, we present ConceptViz, a visual analytics system designed for exploring concepts in LLMs. ConceptViz implements a novel dentification => Interpretation => Validation pipeline, enabling users to query SAEs using concepts of interest, interactively explore concept-to-feature alignments, and validate the correspondences through model behavior verification. We demonstrate the effectiveness of ConceptViz through two usage scenarios and a user study. Our results show that ConceptViz enhances interpretability research by streamlining the discovery and validation of meaningful concept representations in LLMs, ultimately aiding researchers in building more accurate mental models of LLM features. Our code and user guide are publicly available at https://github.com/Happy-Hippo209/ConceptViz.
Chinese: ConceptViz是一个可视化分析系统,通过创新的识别-解释-验证流程,弥合了稀疏自编码器特征与人类可理解概念之间的鸿沟,帮助研究人员有效探索和验证大语言模型中的概念表征。
English: ConceptViz is a visual analytics system that bridges the gap between sparse autoencoder features and human-understandable concepts in large language models, enabling efficient discovery and validation of interpretable representations through an interactive pipeline.

Authors:Nithin Somasekharan, Ling Yue, Yadi Cao, Weichao Li, Patrick Emami, Pochinapeddi Sai Bhargav, Anurag Acharya, Xingyu Xie, Shaowu Pan
Title: CFD-LLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
Abstract:
Large Language Models (LLMs) have demonstrated strong performance across general NLP tasks, but their utility in automating numerical experiments of complex physical system -- a critical and labor-intensive component -- remains underexplored. As the major workhorse of computational science over the past decades, Computational Fluid Dynamics (CFD) offers a uniquely challenging testbed for evaluating the scientific capabilities of LLMs. We introduce CFDLLMBench, a benchmark suite comprising three complementary components -- CFDQuery, CFDCodeBench, and FoamBench -- designed to holistically evaluate LLM performance across three key competencies: graduate-level CFD knowledge, numerical and physical reasoning of CFD, and context-dependent implementation of CFD workflows. Grounded in real-world CFD practices, our benchmark combines a detailed task taxonomy with a rigorous evaluation framework to deliver reproducible results and quantify LLM performance across code executability, solution accuracy, and numerical convergence behavior. CFDLLMBench establishes a solid foundation for the development and evaluation of LLM-driven automation of numerical experiments for complex physical systems. Code and data are available at https://github.com/NREL-Theseus/cfdllmbench/.
中文: 大型语言模型在自动化复杂物理系统实验方面具有潜力,CFDLLMBench基准通过评估其在计算流体动力学的知识、推理和实施能力,为此提供了系统性验证基础。
English: Large Language Models (LLMs) show potential in automating complex physical system experiments, as demonstrated by the CFDLLMBench benchmark designed to evaluate their capabilities in computational fluid dynamics knowledge, reasoning, and implementation.

Authors:Adithya Bhaskar, Xi Ye, Danqi Chen
Title: Language Models that Think, Chat Better
Abstract:
Reinforcement learning with verifiable rewards (RLVR) improves language model reasoning by using rule-based rewards in verifiable domains such as mathematics and code. However, RLVR leads to limited generalization for open-ended tasks -- such as writing outline essays or making meal plans -- where humans reason routinely. This paper shows that the RLVR paradigm is effective beyond verifiable domains, and introduces **RL** with **M**odel-rewarded **T**hinking (**RLMT**) for general-purpose chat capabilities. Using diverse real-world prompts, RLMT requires LMs to generate long CoT reasoning before response, and optimizes them with online RL against a preference-based reward model used in RLHF. Across 40 training runs on Llama-3.1-8B and Qwen-2.5-7B (both base and instruct) and multiple optimization algorithms (DPO, PPO, and GRPO), RLMT consistently outperforms standard RLHF pipelines. This includes substantial gains of 3-7 points on three chat benchmarks (AlpacaEval2, WildBench, and ArenaHardV2), along with 1-3 point improvements on other tasks like creative writing and general knowledge. Our best 8B model surpasses GPT-4o in chat and creative writing and rivals Claude-3.7-Sonnet (Thinking). RLMT can also be applied directly to base models without an SFT stage, akin to R1-Zero training. Remarkably, with only 7K prompts, Llama-3.1-8B base trained with our RLMT recipe outperforms Llama-3.1-8B-Instruct post-trained with a complex multi-staged pipeline with 25M+ examples. We close with qualitative and quantitative analyses of how trained models plan their responses. Our results rethink the post-training pipeline and call upon future work to understand and employ thinking more broadly.
中文: RLMT通过在线强化学习利用奖励模型优化语言模型的思维链推理,在多种聊天基准测试中显著超越RLHF,甚至能与GPT-4o和Claude-3.7-Sonnet等先进模型相媲美。
English: RLMT enhances language models for general chat by optimizing them with online reinforcement learning using reward models on reasoning chains, outperforming RLHF across benchmarks and even rivaling advanced models like GPT-4o and Claude-3.7-Sonnet.

Authors:Sara Fridovich-Keil, Mert Pilanci
Title: A Recovery Guarantee for Sparse Neural Networks
Abstract:
We prove the first guarantees of sparse recovery for ReLU neural networks, where the sparse network weights constitute the signal to be recovered. Specifically, we study structural properties of the sparse network weights for two-layer, scalar-output networks under which a simple iterative hard thresholding algorithm recovers these weights exactly, using memory that grows linearly in the number of nonzero weights. We validate this theoretical result with simple experiments on recovery of sparse planted MLPs, MNIST classification, and implicit neural representations. Experimentally, we find performance that is competitive with, and often exceeds, a high-performing but memory-inefficient baseline based on iterative magnitude pruning.
中文: 本研究首次为ReLU神经网络的稀疏恢复提供了理论保证,证明在特定结构条件下,采用内存高效的迭代硬阈值算法能精确恢复稀疏网络权重,实验结果表明其性能优于内存密集型基线方法。
English: This study provides the first guarantees for sparse recovery in ReLU neural networks, demonstrating that a memory-efficient iterative hard thresholding algorithm can exactly recover sparse network weights under specific structural conditions, with experimental results outperforming memory-intensive baselines.

Authors:Bishal Adhikari, Jiajia Li, Eric S. Michel, Jacob Dykes, Te-Ming Paul Tseng, Mary Love Tagert, Dong Chen
Title: A Comprehensive Evaluation of YOLO-based Deer Detection Performance on Edge Devices
Abstract:
The escalating economic losses in agriculture due to deer intrusion, estimated to be in the hundreds of millions of dollars annually in the U.S., highlight the inadequacy of traditional mitigation strategies since these methods are often labor-intensive, costly, and ineffective for modern farming systems. To overcome this, there is a critical need for intelligent, autonomous solutions which require accurate and efficient deer detection. But the progress in this field is impeded by a significant gap in the literature, mainly the lack of a domain-specific, practical dataset and limited study on the on-field deployability of deer detection systems. Addressing this gap, this study presents a comprehensive evaluation of state-of-the-art deep learning models for deer detection in challenging real-world scenarios. The contributions of this work are threefold. First, we introduce a curated, publicly available dataset of 3,095 annotated images with bounding-box annotations of deer, derived from the Idaho Cameratraps project. Second, we provide an extensive comparative analysis of 12 model variants across four recent YOLO architectures(v8, v9, v10, and v11). Finally, we benchmarked performance on a high-end NVIDIA RTX 5090 GPU and evaluated on two representative edge computing platforms: Raspberry Pi 5 and NVIDIA Jetson AGX Xavier. Results show that the real-time detection is not feasible in Raspberry Pi without hardware-specific model optimization, while NVIDIA Jetson provides greater than 30 FPS with GPU-accelerated inference on 's' and 'n' series models. This study also reveals that smaller, architecturally advanced models such as YOLOv11n, YOLOv8s, and YOLOv9s offer the optimal balance of high accuracy (AP@.5 > 0.85) and computational efficiency (FPS > 30). To support further research, both the source code and datasets are publicly available at https://github.com/WinnerBishal/track-the-deer.
中文: 本研究通过提供标注数据集并评估先进深度学习模型,解决了传统方法在应对农业中鹿群入侵方面的不足,发现优化后的模型如YOLOv11n在边缘设备上能实现高精度和实时检测性能。
English: This study addresses the inadequacy of traditional methods for mitigating deer intrusion in agriculture by introducing a curated dataset and evaluating advanced deep learning models, finding that optimized models like YOLOv11n achieve high accuracy and real-time performance on edge devices such as NVIDIA Jetson.

Authors:Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, Dahua Lin
Title: SIM-CoT: Supervised Implicit Chain-of-Thought
Abstract:
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models (LLMs), but a persistent performance gap has limited their adoption. We identify a core latent instability issue when scaling the computational budget of implicit CoT: as the number of reasoning tokens increases, training often becomes unstable and collapses. Our analysis shows that this instability arises from latent representations becoming homogeneous and losing semantic diversity, caused by insufficient step-level supervision in current implicit CoT methods. To address this, we propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space. SIM-CoT employs an auxiliary decoder during training to align each implicit token with its corresponding explicit reasoning step, ensuring latent states capture distinct and meaningful information. The auxiliary decoder is removed at inference, preserving the efficiency of implicit CoT with no added overhead. It also provides interpretability by projecting each latent token onto an explicit reasoning vocabulary, enabling per-step visualization and diagnosis. SIM-CoT significantly improves both in-domain accuracy and out-of-domain stability of implicit CoT methods, boosting Coconut by +8.2\% on GPT-2 and CODI by +3.0\% on LLaMA-3.1 8B. It further surpasses the explicit CoT baseline on GPT-2 by 2.1\% with 2.3$\times$ greater token efficiency, while closing the performance gap on larger models like LLaMA-3.1 8B. Code: https://github.com/InternLM/SIM-CoT
中文摘要:隐式思维链方法在扩展推理令牌时存在性能不稳定问题,SIM-CoT通过引入步骤级监督来稳定训练过程,在保持推理效率的同时显著提升了准确性和稳定性。
English Summary: Implicit Chain-of-Thought methods face performance instability when scaling reasoning tokens, which SIM-CoT addresses through step-level supervision to stabilize training and enhance both accuracy and efficiency without inference overhead.

Authors:Xichen Xu, Yanshu Wang, Jinbao Wang, Xiaoning Lei, Guoyang Xie, Guannan Jiang, Zhichao Lu
Title: FAST: Foreground-aware Diffusion with Accelerated Sampling Trajectory for Segmentation-oriented Anomaly Synthesis
Abstract:
Industrial anomaly segmentation relies heavily on pixel-level annotations, yet real-world anomalies are often scarce, diverse, and costly to label. Segmentation-oriented industrial anomaly synthesis (SIAS) has emerged as a promising alternative; however, existing methods struggle to balance sampling efficiency and generation quality. Moreover, most approaches treat all spatial regions uniformly, overlooking the distinct statistical differences between anomaly and background areas. This uniform treatment hinders the synthesis of controllable, structure-specific anomalies tailored for segmentation tasks. In this paper, we propose FAST, a foreground-aware diffusion framework featuring two novel modules: the Anomaly-Informed Accelerated Sampling (AIAS) and the Foreground-Aware Reconstruction Module (FARM). AIAS is a training-free sampling algorithm specifically designed for segmentation-oriented industrial anomaly synthesis, which accelerates the reverse process through coarse-to-fine aggregation and enables the synthesis of state-of-the-art segmentation-oriented anomalies in as few as 10 steps. Meanwhile, FARM adaptively adjusts the anomaly-aware noise within the masked foreground regions at each sampling step, preserving localized anomaly signals throughout the denoising trajectory. Extensive experiments on multiple industrial benchmarks demonstrate that FAST consistently outperforms existing anomaly synthesis methods in downstream segmentation tasks. We release the code at: https://github.com/Chhro123/fast-foreground-aware-anomaly-synthesis.
中文:FAST框架通过AIAS和FARM两个创新模块,在加速采样的同时保持局部异常信号,能高效生成面向分割的高质量工业异常样本,在下游任务中显著优于现有方法。
English: The proposed FAST framework introduces two novel modules, AIAS and FARM, to efficiently generate high-quality, segmentation-oriented industrial anomalies by accelerating sampling and preserving localized anomaly signals, significantly outperforming existing methods in downstream tasks.

Authors:Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson
Title: When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity
Abstract:
LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at https://github.com/penfever/judgment-to-noise
中文:LLM评判的基准因设计缺陷常产生不可靠排名,但新诊断工具揭示了高解释方差和排名不确定性,呼吁采用范围更明确且关注可靠性的设计。
English: LLM-judged benchmarks often produce unreliable rankings due to design flaws, but new diagnostic tools reveal high unexplained variance and ranking uncertainty, urging better-scoped and reliability-aware designs.

Authors:Dayu Tan, Jing Chen, Xiaoping Zhou, Yansen Su, Chunhou Zheng
Title: PGCLODA: Prompt-Guided Graph Contrastive Learning for Oligopeptide-Infectious Disease Association Prediction
Abstract:
Infectious diseases continue to pose a serious threat to public health, underscoring the urgent need for effective computational approaches to screen novel anti-infective agents. Oligopeptides have emerged as promising candidates in antimicrobial research due to their structural simplicity, high bioavailability, and low susceptibility to resistance. Despite their potential, computational models specifically designed to predict associations between oligopeptides and infectious diseases remain scarce. This study introduces a prompt-guided graph-based contrastive learning framework (PGCLODA) to uncover potential associations. A tripartite graph is constructed with oligopeptides, microbes, and diseases as nodes, incorporating both structural and semantic information. To preserve critical regions during contrastive learning, a prompt-guided graph augmentation strategy is employed to generate meaningful paired views. A dual encoder architecture, integrating Graph Convolutional Network (GCN) and Transformer, is used to jointly capture local and global features. The fused embeddings are subsequently input into a multilayer perceptron (MLP) classifier for final prediction. Experimental results on a benchmark dataset indicate that PGCLODA consistently outperforms state-of-the-art models in AUROC, AUPRC, and accuracy. Ablation and hyperparameter studies confirm the contribution of each module. Case studies further validate the generalization ability of PGCLODA and its potential to uncover novel, biologically relevant associations. These findings offer valuable insights for mechanism-driven discovery and oligopeptide-based drug development. The source code of PGCLODA is available online at https://github.com/jjnlcode/PGCLODA.
中文: 本研究提出的PGCLODA框架通过提示引导的图对比学习方法,能有效预测寡肽与传染病的关联关系,在预测性能上显著优于现有模型,为抗感染药物研发提供了重要参考。
English: This study introduces PGCLODA, a novel prompt-guided graph contrastive learning framework that effectively predicts associations between oligopeptides and infectious diseases, demonstrating superior performance over existing models and offering valuable insights for antimicrobial drug development.

Authors:Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng, Weimin Zhong
Title: HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy
Abstract:
Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.
中文摘要:本文提出的HiPerformer模型通过模块化分层编码器和专门设计的融合模块,实现了多尺度特征的动态整合,有效克服传统混合架构的局限性,在多个医学图像分割数据集上展现出卓越性能。
English Summary: The proposed HiPerformer model introduces a modular hierarchical encoder and specialized fusion modules to dynamically integrate multi-scale features, effectively overcoming limitations of conventional hybrid architectures and achieving superior medical image segmentation performance across multiple datasets.

Authors:Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng, Weimin Zhong
Title: HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy
Abstract:
Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.
中文摘要:本文提出的HiPerformer模型通过模块化分层编码器和专门设计的融合模块,实现了多尺度特征的动态整合,有效克服传统混合架构的局限性,在多个医学图像分割数据集上展现出卓越性能。
English Summary: The proposed HiPerformer model introduces a modular hierarchical encoder and specialized fusion modules to dynamically integrate multi-scale features, effectively overcoming limitations of conventional hybrid architectures and achieving superior medical image segmentation performance across multiple datasets.

Authors:Kwang-Hyun Uhm, Hyunjun Cho, Sung-Hoo Hong, Seung-Won Jung
Title: An Anisotropic Cross-View Texture Transfer with Multi-Reference Non-Local Attention for CT Slice Interpolation
Abstract:
Computed tomography (CT) is one of the most widely used non-invasive imaging modalities for medical diagnosis. In clinical practice, CT images are usually acquired with large slice thicknesses due to the high cost of memory storage and operation time, resulting in an anisotropic CT volume with much lower inter-slice resolution than in-plane resolution. Since such inconsistent resolution may lead to difficulties in disease diagnosis, deep learning-based volumetric super-resolution methods have been developed to improve inter-slice resolution. Most existing methods conduct single-image super-resolution on the through-plane or synthesize intermediate slices from adjacent slices; however, the anisotropic characteristic of 3D CT volume has not been well explored. In this paper, we propose a novel cross-view texture transfer approach for CT slice interpolation by fully utilizing the anisotropic nature of 3D CT volume. Specifically, we design a unique framework that takes high-resolution in-plane texture details as a reference and transfers them to low-resolution through-plane images. To this end, we introduce a multi-reference non-local attention module that extracts meaningful features for reconstructing through-plane high-frequency details from multiple in-plane images. Through extensive experiments, we demonstrate that our method performs significantly better in CT slice interpolation than existing competing methods on public CT datasets including a real-paired benchmark, verifying the effectiveness of the proposed framework. The source code of this work is available at https://github.com/khuhm/ACVTT.
中文: 本文提出了一种新颖的跨视图纹理迁移方法,通过利用3D CT图像的各向异性特征,将高分辨率平面内纹理细节迁移至低分辨率平面间图像,在CT切片插值任务中显著优于现有方法。
English: This paper introduces a novel cross-view texture transfer method that leverages the anisotropic nature of 3D CT volumes to enhance inter-slice resolution by transferring high-resolution in-plane textures to through-plane images, demonstrating superior performance in CT slice interpolation compared to existing methods.

Authors:Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir
Title: ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
Abstract:
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance towards texture. Code is available at https://github.com/tomburgert/feature-reliance.
中文: 该研究挑战了卷积神经网络天生偏向纹理的假设,通过领域无关框架证明其主要依赖局部形状特征,且在计算机视觉、医学影像和遥感领域表现出不同的特征依赖模式。
English: The study challenges the notion that CNNs are inherently texture-biased, demonstrating through a domain-agnostic framework that they primarily rely on local shape features, with reliance patterns varying across computer vision, medical imaging, and remote sensing domains.

Authors:Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir
Title: ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
Abstract:
The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.
中文: 该研究挑战了卷积神经网络天生偏向纹理的假设,通过领域无关框架证明其主要依赖局部形状特征,且在计算机视觉、医学影像和遥感领域表现出不同的特征依赖模式。
English: The study challenges the notion that CNNs are inherently texture-biased, demonstrating through a domain-agnostic framework that they primarily rely on local shape features, with reliance patterns varying across computer vision, medical imaging, and remote sensing domains.

Authors:Deokjae Lee, Hyun Oh Song
Title: Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment
Abstract:
We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.
Chinese: 本研究提出了Q-Palette,一套用于大语言模型仅权重量化的分数位量化器集合,在资源约束下优化量化性能与推理速度,并支持混合方案框架。
English: This research introduces Q-Palette, a collection of fractional-bit quantizers for weight-only post-training quantization of large language models, which optimizes quantization performance and inference speed while enabling a mixed-scheme framework under resource constraints.

Authors:Hailay Kidu Teklehaymanot, Gebrearegawi Gidey, Wolfgang Nejdl
Title: Low-Resource English-Tigrinya MT: Leveraging Multilingual Models, Custom Tokenizers, and Clean Evaluation Benchmarks
Abstract:
Despite advances in Neural Machine Translation (NMT), low-resource languages like Tigrinya remain underserved due to persistent challenges, including limited corpora, inadequate tokenization strategies, and the lack of standardized evaluation benchmarks. This paper investigates transfer learning techniques using multilingual pretrained models to enhance translation quality for morphologically rich, low-resource languages. We propose a refined approach that integrates language-specific tokenization, informed embedding initialization, and domain-adaptive fine-tuning. To enable rigorous assessment, we construct a high-quality, human-aligned English-Tigrinya evaluation dataset covering diverse domains. Experimental results demonstrate that transfer learning with a custom tokenizer substantially outperforms zero-shot baselines, with gains validated by BLEU, chrF, and qualitative human evaluation. Bonferroni correction is applied to ensure statistical significance across configurations. Error analysis reveals key limitations and informs targeted refinements. This study underscores the importance of linguistically aware modeling and reproducible benchmarks in bridging the performance gap for underrepresented languages. Resources are available at https://github.com/hailaykidu/MachineT_TigEng and https://huggingface.co/Hailay/MachineT_TigEng
中文:本研究通过采用定制化分词和领域自适应的迁移学习方法,提升了提格里尼亚语的神经机器翻译质量,并利用新建评估数据集验证了其显著性能提升。
English: This study improves Tigrinya neural machine translation through transfer learning with customized tokenization and domain adaptation, validated by a new evaluation dataset showing significant performance gains.

Authors:Parker Glenn, Alfy Samuel, Daben Liu
Title: Play by the Type Rules: Inferring Constraints for LLM Functions in Declarative Programs
Abstract:
Integrating LLM powered operators in declarative query languages allows for the combination of cheap and interpretable functions with powerful, generalizable language model reasoning. However, in order to benefit from the optimized execution of a database query language like SQL, generated outputs must align with the rules enforced by both type checkers and database contents. Current approaches address this challenge with orchestrations consisting of many LLM-based post-processing calls to ensure alignment between generated outputs and database values, introducing performance bottlenecks. We perform a study on the ability of various sized open-source language models to both parse and execute functions within a query language based on SQL, showing that small language models can excel as function executors over hybrid data sources. Then, we propose an efficient solution to enforce the well-typedness of LLM functions, demonstrating 7% accuracy improvement on a multi-hop question answering dataset with 53% improvement in latency over comparable solutions. We make our implementation available at https://github.com/parkervg/blendsql
中文摘要:研究表明小型语言模型能有效执行类SQL查询语言中的函数,并提出一种高效解决方案,相比现有方法在准确率上提升7%,延迟降低53%。
English Summary: This study demonstrates that small language models can effectively execute functions within SQL-like query languages, proposing an efficient solution that improves accuracy by 7% and reduces latency by 53% compared to existing methods.

Authors:Mahmoud Khater, Mona Strauss, Philipp von Olshausen, Alexander Reiterer
Title: PU-Gaussian: Point Cloud Upsampling using 3D Gaussian Representation
Abstract:
Point clouds produced by 3D sensors are often sparse and noisy, posing challenges for tasks requiring dense and high-fidelity 3D representations. Prior work has explored both implicit feature-based upsampling and distance-function learning to address this, but often at the expense of geometric interpretability or robustness to input sparsity. To overcome these limitations, we propose PU-Gaussian, a novel upsampling network that models the local neighborhood around each point using anisotropic 3D Gaussian distributions. These Gaussians capture the underlying geometric structure, allowing us to perform upsampling explicitly in the local geometric domain by direct point sampling. The sampling process generates a dense, but coarse, point cloud. A subsequent refinement network adjusts the coarse output to produce a more uniform distribution and sharper edges. We perform extensive testing on the PU1K and PUGAN datasets, demonstrating that PU-Gaussian achieves state-of-the-art performance. We make code and model weights publicly available at https://github.com/mvg-inatech/PU-Gaussian.git.
中文摘要:PU-Gaussian是一种创新的上采样网络,通过各向异性3D高斯分布建模局部几何结构,实现显式上采样和精细化处理,在基准数据集上取得了最优性能,能生成稠密且高保真的点云。
English Summary: PU-Gaussian is a novel upsampling network that uses anisotropic 3D Gaussian distributions to model local geometry, enabling explicit upsampling and refinement for dense, high-fidelity point clouds, achieving state-of-the-art results on benchmark datasets.

Authors:Philipp Erler, Lukas Herzberger, Michael Wimmer, Markus Schütz
Title: LidarScout: Direct Out-of-Core Rendering of Massive Point Clouds
Abstract:
Large-scale terrain scans are the basis for many important tasks, such as topographic mapping, forestry, agriculture, and infrastructure planning. The resulting point cloud data sets are so massive in size that even basic tasks like viewing take hours to days of pre-processing in order to create level-of-detail structures that allow inspecting the data set in their entirety in real time. In this paper, we propose a method that is capable of instantly visualizing massive country-sized scans with hundreds of billions of points. Upon opening the data set, we first load a sparse subsample of points and initialize an overview of the entire point cloud, immediately followed by a surface reconstruction process to generate higher-quality, hole-free heightmaps. As users start navigating towards a region of interest, we continue to prioritize the heightmap construction process to the user's viewpoint. Once a user zooms in closely, we load the full-resolution point cloud data for that region and update the corresponding height map textures with the full-resolution data. As users navigate elsewhere, full-resolution point data that is no longer needed is unloaded, but the updated heightmap textures are retained as a form of medium level of detail. Overall, our method constitutes a form of direct out-of-core rendering for massive point cloud data sets (terabytes, compressed) that requires no preprocessing and no additional disk space. Source code, executable, pre-trained model, and dataset are available at: https://github.com/cg-tuwien/lidarscout
中文: 本文提出一种无需预处理即可实时可视化海量点云的方法,通过根据用户导航动态加载和重建高度图,实现即时浏览。
English: This paper presents a real-time visualization method for massive country-sized point cloud scans that enables instant viewing without preprocessing by dynamically loading and reconstructing heightmaps based on user navigation.

Authors:Chaojun Nie, Jun Zhou, Guanxiang Wang, Shisong Wud, Zichen Wang
Title: Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
Abstract:
Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.
中文摘要:提出的强化学习增强生成(RLAG)方法通过奖励引导的迭代优化,有效克服了现有技术在领域知识整合中的不足,在多个专业领域显著提升了模型的准确性和解释合理性。
English Summary: The proposed Reinforcement Learning from Augmented Generation (RLAG) method overcomes limitations of existing approaches by iteratively optimizing models through reward-guided sampling, significantly enhancing domain-specific knowledge integration and reasoning across multiple specialized fields.

Authors:Chaojun Nie, Jun Zhou, Guanxiang Wang, Shisong Wu, Zichen Wang
Title: Embedding Domain Knowledge for Large Language Models via Reinforcement Learning from Augmented Generation
Abstract:
Large language models (LLMs) often exhibit limited performance on domain-specific tasks due to the natural disproportionate representation of specialized information in their training data and the static nature of these datasets. Knowledge scarcity and temporal lag create knowledge gaps for domain applications. While post-training on domain datasets can embed knowledge into models, existing approaches have some limitations. Continual Pre-Training (CPT) treats all tokens in domain documents with equal importance, failing to prioritize critical knowledge points, while supervised fine-tuning (SFT) with question-answer pairs struggles to develop the coherent knowledge structures necessary for complex reasoning tasks. To address these challenges, we propose Reinforcement Learning from Augmented Generation (RLAG). Our approach iteratively cycles between sampling generations and optimizing the model through calculated rewards, effectively embedding critical and contextually coherent domain knowledge. We select generated outputs with the highest log probabilities as the sampling result, then compute three tailored reward metrics to guide the optimization process. To comprehensively evaluate domain expertise, we assess answer accuracy and the rationality of explanations generated for correctly answered questions. Experimental results across medical, legal, astronomy, and current events datasets demonstrate that our proposed method significantly outperforms baseline approaches. Our code and data are open sourced at https://github.com/ChaojunNie/RLAG.
中文摘要:提出的强化学习增强生成(RLAG)方法通过奖励引导的迭代优化,有效克服了现有技术在领域知识整合中的不足,在多个专业领域显著提升了模型的准确性和解释合理性。
English Summary: The proposed Reinforcement Learning from Augmented Generation (RLAG) method overcomes limitations of existing approaches by iteratively optimizing models through reward-guided sampling, significantly enhancing domain-specific knowledge integration and reasoning across multiple specialized fields.

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Title: U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT
Abstract:
Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.872 and a DSC of 0.969 on the validation dataset, demonstrating the superior performance of our approach. The code is available at https://github.com/zhiqin1998/UMamba2.
中文: 本文提出U-Mamba2-SSL半监督学习框架,通过多阶段训练提升CBCT分割精度,并在STSR 2025挑战赛中荣获第一名。
English: This paper introduces U-Mamba2-SSL, a semi-supervised learning framework that enhances CBCT segmentation accuracy through multi-stage training and achieved top performance in the STSR 2025 challenge.

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Title: U-Mamba2-SSL for Semi-Supervised Tooth and Pulp Segmentation in CBCT
Abstract:
Accurate segmentation of teeth and pulp in Cone-Beam Computed Tomography (CBCT) is vital for clinical applications like treatment planning and diagnosis. However, this process requires extensive expertise and is exceptionally time-consuming, highlighting the critical need for automated algorithms that can effectively utilize unlabeled data. In this paper, we propose U-Mamba2-SSL, a novel semi-supervised learning framework that builds on the U-Mamba2 model and employs a multi-stage training strategy. The framework first pre-trains U-Mamba2 in a self-supervised manner using a disruptive autoencoder. It then leverages unlabeled data through consistency regularization, where we introduce input and feature perturbations to ensure stable model outputs. Finally, a pseudo-labeling strategy is implemented with a reduced loss weighting to minimize the impact of potential errors. U-Mamba2-SSL achieved an average score of 0.789 and a DSC of 0.917 on the hidden test set, achieving first place in Task 1 of the STSR 2025 challenge. The code is available at https://github.com/zhiqin1998/UMamba2.
中文: 本文提出U-Mamba2-SSL半监督学习框架,通过多阶段训练提升CBCT分割精度,并在STSR 2025挑战赛中荣获第一名。
English: This paper introduces U-Mamba2-SSL, a semi-supervised learning framework that enhances CBCT segmentation accuracy through multi-stage training and achieved top performance in the STSR 2025 challenge.

Authors:Min Cen, Zhenfeng Zhuang, Yuzhe Zhang, Min Zeng, Baptiste Magnier, Lequan Yu, Hong Zhang, Liansheng Wang
Title: C$^2$MIL: Synchronizing Semantic and Topological Causalities in Multiple Instance Learning for Robust and Interpretable Survival Analysis
Abstract:
Graph-based Multiple Instance Learning (MIL) is widely used in survival analysis with Hematoxylin and Eosin (H\&E)-stained whole slide images (WSIs) due to its ability to capture topological information. However, variations in staining and scanning can introduce semantic bias, while topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To tackle this, we introduce a dual structural causal model as the theoretical foundation and propose a novel and interpretable dual causal graph-based MIL model, C$^2$MIL. C$^2$MIL incorporates a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy combining disentangling supervision and contrastive learning enables simultaneous refinement of both semantic and topological causalities. Experiments demonstrate that C$^2$MIL consistently improves generalization and interpretability over existing methods and can serve as a causal enhancement for diverse MIL baselines. The code is available at https://github.com/mimic0127/C2MIL.
中文:提出的C$^2$MIL模型通过双结构因果框架整合语义因果干预与拓扑因果发现,有效解决了基于图的多示例学习中染色差异和无关拓扑噪声问题,显著提升了泛化能力和可解释性。
English: The proposed C$^2$MIL model addresses staining variations and irrelevant topological noise in graph-based MIL for survival analysis by integrating semantic causal intervention and topological causal discovery through a dual structural causal framework, significantly enhancing both generalization and interpretability.

Authors:Zizheng Yang, Hu Yu, Bing Li, Jinghao Zhang, Jie Huang, Feng Zhao
Title: Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing
Abstract:
Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at https://github.com/aaaasan111/difflid.
中文摘要:提出的DiffLI²D网络利用预训练扩散模型的语义潜在空间特性,无需重新训练或迭代采样即可实现高效图像去雾,在多个数据集上达到了最优性能。
English Summary: The proposed DiffLI²D network leverages the semantic latent space of pre-trained diffusion models to enable efficient image dehazing without retraining or iterative sampling, achieving state-of-the-art performance across multiple datasets.

Authors:Manahil Raza, Ayesha Azam, Talha Qaiser, Nasir Rajpoot
Title: PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction
Abstract:
Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: https://github.com/manahilr/PS3.
中文摘要:本研究提出PS3模型,通过原型表示整合病理报告、全切片图像和转录组数据,利用基于Transformer的融合方法提升癌症生存预测性能,在多个TCGA数据集上验证了其优于现有方法的有效性。
English Summary: This study introduces PS3, a transformer-based model that integrates pathology reports, whole slide images, and transcriptomic data through prototype representations to improve cancer survival prediction, demonstrating superior performance over existing methods across multiple TCGA datasets.

Authors:Nico Schulthess, Ender Konukoglu
Title: Anomaly Detection by Clustering DINO Embeddings using a Dirichlet Process Mixture
Abstract:
In this work, we leverage informative embeddings from foundational models for unsupervised anomaly detection in medical imaging. For small datasets, a memory-bank of normative features can directly be used for anomaly detection which has been demonstrated recently. However, this is unsuitable for large medical datasets as the computational burden increases substantially. Therefore, we propose to model the distribution of normative DINOv2 embeddings with a Dirichlet Process Mixture model (DPMM), a non-parametric mixture model that automatically adjusts the number of mixture components to the data at hand. Rather than using a memory bank, we use the similarity between the component centers and the embeddings as anomaly score function to create a coarse anomaly segmentation mask. Our experiments show that through DPMM embeddings of DINOv2, despite being trained on natural images, achieve very competitive anomaly detection performance on medical imaging benchmarks and can do this while at least halving the computation time at inference. Our analysis further indicates that normalized DINOv2 embeddings are generally more aligned with anatomical structures than unnormalized features, even in the presence of anomalies, making them great representations for anomaly detection. The code is available at https://github.com/NicoSchulthess/anomalydino-dpmm.
中文: 本研究提出一种无监督医学影像异常检测方法,通过狄利克雷过程混合模型对DINOv2特征进行建模,在降低计算成本的同时实现了优越的检测性能,并将推理时间至少缩短一半。
English: This study introduces an unsupervised anomaly detection method for medical imaging by modeling DINOv2 embeddings with a Dirichlet Process Mixture model, which reduces computational costs while achieving competitive performance and faster inference times.

Authors:Sepehr Maleki, Negar Pourmoazemi
Title: Pi-Transformer: A Physics-informed Attention Mechanism for Time Series Anomaly Detection
Abstract:
Anomalies in multivariate time series often arise from temporal context and cross-channel coordination rather than isolated outliers. We present Pi-Transformer, a physics-informed transformer with two attention pathways: a data-driven series attention and a smoothly evolving prior attention that encodes temporal invariants such as scale-related self-similarity and phase synchrony. The prior acts as a stable reference that calibrates reconstruction error. During training, we pair a reconstruction objective with a divergence term that encourages agreement between the two attentions while keeping them meaningfully distinct; the prior is regularised to evolve smoothly and is lightly distilled towards dataset-level statistics. At inference, the model combines an alignment-weighted reconstruction signal (Energy) with a mismatch signal that highlights timing and phase disruptions, and fuses them into a single score for detection. Across five benchmarks (SMD, MSL, SMAP, SWaT, and PSM), Pi-Transformer achieves state-of-the-art or highly competitive F1, with particular strength on timing and phase-breaking anomalies. Case analyses show complementary behaviour of the two streams and interpretable detections around regime changes. Embedding physics-informed priors into attention yields a calibrated and robust approach to anomaly detection in complex multivariate systems. Code is publicly available at this GitHub repository\footnote{https://github.com/sepehr-m/Pi-Transformer}.
中文摘要:Pi-Transformer提出了一种双注意力变换器,通过结合数据驱动分析和物理启发的时序不变性,利用校准重构与失配信号检测异常,在多个基准测试中实现了领先性能。
English Summary: Pi-Transformer introduces a dual-attention transformer that integrates data-driven analysis with physics-informed temporal invariants to detect anomalies through calibrated reconstruction and mismatch signals, achieving state-of-the-art performance on multiple benchmarks.

Authors:Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Title: RAD: Towards Trustworthy Retrieval-Augmented Multi-modal Clinical Diagnosis
Abstract:
Clinical diagnosis is a highly specialized discipline requiring both domain expertise and strict adherence to rigorous guidelines. While current AI-driven medical research predominantly focuses on knowledge graphs or natural text pretraining paradigms to incorporate medical knowledge, these approaches primarily rely on implicitly encoded knowledge within model parameters, neglecting task-specific knowledge required by diverse downstream tasks. To address this limitation, we propose Retrieval-Augmented Diagnosis (RAD), a novel framework that explicitly injects external knowledge into multimodal models directly on downstream tasks. Specifically, RAD operates through three key mechanisms: retrieval and refinement of disease-centered knowledge from multiple medical sources, a guideline-enhanced contrastive loss that constrains the latent distance between multi-modal features and guideline knowledge, and the dual transformer decoder that employs guidelines as queries to steer cross-modal fusion, aligning the models with clinical diagnostic workflows from guideline acquisition to feature extraction and decision-making. Moreover, recognizing the lack of quantitative evaluation of interpretability for multimodal diagnostic models, we introduce a set of criteria to assess the interpretability from both image and text perspectives. Extensive evaluations across four datasets with different anatomies demonstrate RAD's generalizability, achieving state-of-the-art performance. Furthermore, RAD enables the model to concentrate more precisely on abnormal regions and critical indicators, ensuring evidence-based, trustworthy diagnosis. Our code is available at https://github.com/tdlhl/RAD.
中文摘要:提出的检索增强诊断(RAD)框架通过检索机制和专用解码器显式整合外部医学知识,增强了多模态诊断模型的性能,在多个临床数据集中实现了卓越的诊断准确性和可解释性。
English Summary: The proposed Retrieval-Augmented Diagnosis (RAD) framework enhances multimodal diagnostic models by explicitly integrating external medical knowledge through retrieval mechanisms and specialized decoders, achieving superior performance and interpretability across diverse clinical datasets.

Authors:Albina Klepach, Egor E. Nuzhin, Alexey A. Tsukanov, Nikolay V. Brilliantov
Title: An effective control of large systems of active particles: An application to evacuation problem
Abstract:
Manipulation of large systems of active particles is a serious challenge across diverse domains, including crowd management, control of robotic swarms, and coordinated material transport. The development of advanced control strategies for complex scenarios is hindered, however, by the lack of scalability and robustness of the existing methods, in particular, due to the need of an individual control for each agent. One possible solution involves controlling a system through a leader or a group of leaders, which other agents tend to follow. Using such an approach we develop an effective control strategy for a leader, combining reinforcement learning (RL) with artificial forces acting on the system. To describe the guidance of active particles by a leader we introduce the generalized Vicsek model. This novel method is then applied to the problem of the effective evacuation by a robot-rescuer (leader) of large groups of people from hazardous places. We demonstrate, that while a straightforward application of RL yields suboptimal results, even for advanced architectures, our approach provides a robust and efficient evacuation strategy. The source code supporting this study is publicly available at: https://github.com/cinemere/evacuation.
中文摘要:本研究提出了一种结合强化学习与人工力的方法,通过领导者控制活性粒子系统,为大规模人群从危险区域疏散提供了稳健高效的策略。
English Summary: This study presents a reinforcement learning approach combined with artificial forces to control active particle systems via leaders, offering a robust and efficient strategy for evacuating large groups from hazardous areas.

Authors:Albina Klepach, Egor E. Nuzhin, Alexey A. Tsukanov, Nikolay V. Brilliantov
Title: An effective control of large systems of active particles: An application to evacuation problem
Abstract:
Manipulation of large systems of active particles is a serious challenge across diverse domains, including crowd management, control of robotic swarms, and coordinated material transport. The development of advanced control strategies for complex scenarios is hindered, however, by the lack of scalability and robustness of the existing methods, in particular, due to the need of an individual control for each agent. One possible solution involves controlling a system through a leader or a group of leaders, which other agents tend to follow. Using such an approach we develop an effective control strategy for a leader, combining reinforcement learning (RL) with artificial forces acting on the system. To describe the guidance of active particles by a leader we introduce the generalized Vicsek model. This novel method is then applied to the problem of the effective evacuation by a robot-rescuer (leader) of large groups of people from hazardous places. We demonstrate, that while a straightforward application of RL yields suboptimal results, even for advanced architectures, our approach provides a robust and efficient evacuation strategy. The source code supporting this study is publicly available at: https://github.com/cinemere/evacuation.
中文摘要:本研究提出了一种结合强化学习与人工力的方法,通过领导者控制活性粒子系统,为大规模人群从危险区域疏散提供了稳健高效的策略。
English Summary: This study presents a reinforcement learning approach combined with artificial forces to control active particle systems via leaders, offering a robust and efficient strategy for evacuating large groups from hazardous areas.

Authors:Feiyang Fu, Tongxian Guo, Zhaoqiang Liu
Title: Learnable Sampler Distillation for Discrete Diffusion Models
Abstract:
Discrete diffusion models (DDMs) have shown powerful generation ability for discrete data modalities like text and molecules. However, their practical application is hindered by inefficient sampling, requiring a large number of sampling steps. Accelerating DDMs by using larger step sizes typically introduces significant problems in generation quality, as it amplifies the impact of both the compounding decoding error due to factorized predictions and discretization error from numerical approximations, leading to a significant decrease in sampling quality. To address these challenges, we propose learnable sampler distillation (LSD), a novel approach to train fast and high-fidelity samplers for DDMs. LSD employs a distillation approach where a student sampler with a few steps learns to align its intermediate score trajectory with that of a high-quality teacher sampler with numerous steps. This alignment is achieved by optimizing learnable sampler coefficients that adaptively adjust sampling dynamics. Additionally, we further propose LSD+, which also learns time schedules that allocate steps non-uniformly. Experiments across text generation, image generation, and synthetic tasks demonstrate that our proposed approaches outperform existing samplers for DDMs, achieving substantially higher sampling quality with significantly fewer sampling steps. Our code is available at \href{https://github.com/feiyangfu/LSD}{https://github.com/feiyangfu/LSD}.
中文: 提出的可学习采样器蒸馏(LSD)方法通过训练高效学生采样器来匹配高质量教师采样器的轨迹,使离散扩散模型在文本生成等任务中能以更少采样步骤实现更优生成质量。
English: The proposed learnable sampler distillation (LSD) method trains efficient student samplers to match high-quality teacher trajectories, enabling discrete diffusion models to achieve superior generation quality with significantly fewer sampling steps across various tasks.

Authors:Sarmistha Das, R E Zera Marveen Lyngkhoi, Kirtan Jain, Vinayak Goyal, Sriparna Saha, Manish Gupta
Title: When Words Can't Capture It All: Towards Video-Based User Complaint Text Generation with Multimodal Video Complaint Dataset
Abstract:
While there exists a lot of work on explainable complaint mining, articulating user concerns through text or video remains a significant challenge, often leaving issues unresolved. Users frequently struggle to express their complaints clearly in text but can easily upload videos depicting product defects (e.g., vague text such as `worst product' paired with a 5-second video depicting a broken headphone with the right earcup). This paper formulates a new task in the field of complaint mining to aid the common users' need to write an expressive complaint, which is Complaint Description from Videos (CoD-V) (e.g., to help the above user articulate her complaint about the defective right earcup). To this end, we introduce ComVID, a video complaint dataset containing 1,175 complaint videos and the corresponding descriptions, also annotated with the emotional state of the complainer. Additionally, we present a new complaint retention (CR) evaluation metric that discriminates the proposed (CoD-V) task against standard video summary generation and description tasks. To strengthen this initiative, we introduce a multimodal Retrieval-Augmented Generation (RAG) embedded VideoLLaMA2-7b model, designed to generate complaints while accounting for the user's emotional state. We conduct a comprehensive evaluation of several Video Language Models on several tasks (pre-trained and fine-tuned versions) with a range of established evaluation metrics, including METEOR, perplexity, and the Coleman-Liau readability score, among others. Our study lays the foundation for a new research direction to provide a platform for users to express complaints through video. Dataset and resources are available at: https://github.com/sarmistha-D/CoD-V.
中文: 本文提出了基于视频的投诉描述(CoD-V)新任务,通过利用视频内容帮助用户更清晰地表达产品问题,并提供了ComVID数据集及融合多模态检索增强生成的模型作为支持。
English: This paper introduces Complaint Description from Videos (CoD-V), a novel task that leverages video content to help users articulate product complaints more effectively, supported by the ComVID dataset and a multimodal RAG-enhanced model.

Authors:Miren Samaniego, Igor Rodriguez, Elena Lazkano
Title: CapStARE: Capsule-based Spatiotemporal Architecture for Robust and Efficient Gaze Estimation
Abstract:
We introduce CapStARE, a capsule-based spatio-temporal architecture for gaze estimation that integrates a ConvNeXt backbone, capsule formation with attention routing, and dual GRU decoders specialized for slow and rapid gaze dynamics. This modular design enables efficient part-whole reasoning and disentangled temporal modeling, achieving state-of-the-art performance on ETH-XGaze (3.36) and MPIIFaceGaze (2.65) while maintaining real-time inference (< 10 ms). The model also generalizes well to unconstrained conditions in Gaze360 (9.06) and human-robot interaction scenarios in RT-GENE (4.76), outperforming or matching existing methods with fewer parameters and greater interpretability. These results demonstrate that CapStARE offers a practical and robust solution for real-time gaze estimation in interactive systems. The related code and results for this article can be found on: https://github.com/toukapy/capsStare
中文: CapStARE是一种基于胶囊的时空架构,在多个数据集上实现了最先进的视线估计性能,兼具实时效率和强大的泛化能力。
English: CapStARE is a capsule-based spatio-temporal architecture that achieves state-of-the-art gaze estimation performance with real-time efficiency and strong generalization across multiple datasets.

Authors:Renxiang Wang, Li Zhang
Title: Documentation Retrieval Improves Planning Language Generation
Abstract:
Certain strong LLMs have shown promise for zero-shot formal planning by generating planning languages like PDDL. Yet, performance of most open-source models under 50B parameters has been reported to be close to zero due to the low-resource nature of these languages. We significantly improve their performance via a series of lightweight pipelines that integrates documentation retrieval with modular code generation and error refinement. With models like Llama-4-Maverick, our best pipeline improves plan correctness from 0\% to over 80\% on the common BlocksWorld domain. However, while syntactic errors are substantially reduced, semantic errors persist in more challenging domains, revealing fundamental limitations in current models' reasoning capabilities.\footnote{Our code and data can be found at https://github.com/Nangxxxxx/PDDL-RAG
中文摘要:通过结合文档检索、模块化代码生成和错误修正的轻量级流程,我们显著提升了50B参数以下开源模型在零样本形式化规划中的表现,在BlocksWorld领域将规划正确率从0%提高至80%以上,但语义错误仍暴露出现有模型的根本推理局限。
English Summary: Lightweight pipelines integrating documentation retrieval, modular code generation, and error refinement significantly boost the performance of sub-50B parameter LLMs in zero-shot formal planning, increasing plan correctness from 0% to over 80% in BlocksWorld while revealing persistent semantic reasoning limitations.

Authors:Binbin Zhang, Chengdong Liang, Shuai Wang, Xuelong Geng, Zhao Guo, Haoyu Li, Hao Yin, Xipeng Yang, Pengshen Zhang, Changwei Ma, Lei Xie
Title: WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
Abstract:
In this paper, we present WEST(WE Speech Toolkit), a speech toolkit based on a large language model (LLM) for speech understanding, generation, and interaction. There are three key features of WEST: 1) Fully LLM-based: Standing on the shoulders of giants by reusing mature architectures, ecosystems (e.g., Hugging Face), and methods (e.g., sequence packing) from large models. 2) Full-stack: Supports tasks such as recognition, synthesis, understanding, dialogue, and multimodal capabilities, with extensibility to incorporate open-source models. 3) Simple and Stupid: A simple and stupid speech toolkit that everyone can Touch. In addition, WEST provides two types of recipes, models, and experimental results. The first is entirely based on open-source models and open-source data, allowing users to fully reproduce the experiments in this paper and serving as a verification system or minimal system baseline. The second is trained on massive data, offering superior performance so the user can directly apply it out of the box. WEST is publicly avilable at https://github.com/wenet-e2e/west/
中文: 本文介绍了WEST,一个基于大语言模型的语音工具包,支持语音理解、生成和交互,具备全LLM集成、全栈任务支持和简易设计,并提供开源和高性能版本供用户使用。
English: This paper introduces WEST, a speech toolkit built on a large language model that supports speech understanding, generation, and interaction, featuring full LLM integration, comprehensive task support, and user-friendly design, with open-source and high-performance versions available.

Authors:Xueliang Zhao, Wei Wu, Jian Guan, Zhuocheng Gong, Lingpeng Kong
Title: PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning
Abstract:
Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.
Chinese: PromptCoT 2.0 提出了一个可扩展的框架,通过期望最大化循环生成更困难且更多样化的训练问题,在推理任务的自博弈和监督微调中取得了最先进的成果。
English: PromptCoT 2.0 introduces a scalable framework using an expectation-maximization loop to generate harder and more diverse training problems, achieving state-of-the-art results in self-play and supervised fine-tuning for reasoning tasks.

Authors:Ao Sun, Weilin Zhao, Xu Han, Cheng Yang, Zhiyuan Liu, Chuan Shi, Maosong sun
Title: BurstEngine: an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens
Abstract:
Existing methods for training LLMs on long-sequence data, such as Tensor Parallelism and Context Parallelism, exhibit low Model FLOPs Utilization as sequence lengths and number of GPUs increase, especially when sequence lengths exceed 1M tokens. To address these challenges, we propose BurstEngine, an efficient framework designed to train LLMs on long-sequence data. BurstEngine introduces BurstAttention, an optimized distributed attention with lower communication cost than RingAttention. BurstAttention leverages topology-aware ring communication to fully utilize network bandwidth and incorporates fine-grained communication-computation overlap. Furthermore, BurstEngine introduces sequence-level selective checkpointing and fuses the language modeling head with the loss function to reduce memory cost. Additionally, BurstEngine introduces workload balance optimization for various types of attention masking. By integrating these optimizations, BurstEngine achieves a $1.2\times$ speedup with much lower memory overhead than the state-of-the-art baselines when training LLMs on extremely long sequences of over 1M tokens. We have made our code publicly available on GitHub: https://github.com/thunlp/BurstEngine.
中文摘要:BurstEngine是一个创新框架,通过引入通信成本更低的BurstAttention注意力机制及多项内存优化技术,在超长序列训练中实现了比现有方法快1.2倍的速度提升和更低的内存开销。
English Summary: BurstEngine is a novel framework that enhances LLM training on long sequences by introducing BurstAttention for efficient distributed processing and multiple optimizations, achieving 1.2x speedup with reduced memory overhead for sequences over 1M tokens.

Authors:Sen Yang, Yu Bao, Yu Lu, Jiajun Chen, Shujian Huang, Shanbo Cheng
Title: EnAnchored-X2X: English-Anchored Optimization for Many-to-Many Translation
Abstract:
Large language models (LLMs) have demonstrated strong machine translation capabilities for English-centric language pairs but underperform in direct non-English (x2x) translation. This work addresses this limitation through a synthetic data generation framework that leverages models' established English-to-x (en2x) capabilities. By extending English parallel corpora into omnidirectional datasets and developing an English-referenced quality evaluation proxy, we enable effective collection of high-quality x2x training data. Combined with preference-based optimization, our method achieves significant improvement across 72 x2x directions for widely used LLMs, while generalizing to enhance en2x performance. The results demonstrate that strategic exploitation of English-centric strengths can bootstrap comprehensive multilingual translation capabilities in LLMs. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/EAX
中文摘要:本研究提出一种合成数据生成框架,利用大语言模型的英语中心优势克服其在非英语直译中的局限,在72个语言方向上实现显著提升,同时增强整体多语言翻译能力。
English Summary: This study introduces a synthetic data generation framework that leverages LLMs' English-centric strengths to overcome their limitations in direct non-English translation, achieving significant improvements across 72 language directions while enhancing overall multilingual capabilities.

Authors:Xiangyang Chen, Shuzhao Li, Xiuwen Zhu, Yongfan Chen, Fan Yang, Cheng Fang, Lin Qu, Xiaoxiao Xu, Hu Wei, Minggang Wu
Title: Logics-Parsing Technical Report
Abstract:
Recent advances in Large Vision-Language models (LVLM) have spurred significant progress in document parsing task. Compared to traditional pipeline-based methods, end-to-end paradigms have shown their excellence in converting PDF images into structured outputs through integrated Optical Character Recognition (OCR), table recognition, mathematical formula recognition and so on. However, the absence of explicit analytical stages for document layouts and reading orders limits the LVLM's capability in handling complex document types such as multi-column newspapers or posters. To address this limitation, we propose in this report Logics-Parsing: an end-to-end LVLM-based model augmented with reinforcement learning. Our model incorporates meticulously designed reward mechanisms to optimize complex layout analysis and reading order inference. In addition, we expand the model's versatility by incorporating diverse data types such as chemical formulas and handwritten Chinese characters into supervised fine-tuning. Finally, to enable rigorous evaluation of our approach, we introduce LogicsParsingBench, a curated set of 1,078 page-level PDF images spanning nine major categories and over twenty sub-categories, which will be released later. Comprehensive experiments conducted on LogicsParsingBench have validated the efficacy and State-of-the-art (SOTA) performance of our proposed model across diverse document analysis scenarios. Project Page: https://github.com/alibaba/Logics-Parsing
中文摘要:本报告提出Logics-Parsing模型,通过强化学习优化布局分析和阅读顺序处理,在包含九大类文档的基准测试中展现出卓越性能,有效解决了复杂文档解析的挑战。
English Summary: This report introduces Logics-Parsing, an enhanced end-to-end Large Vision-Language model that integrates reinforcement learning with specialized reward mechanisms to improve layout analysis and reading order processing for complex documents, validated through a comprehensive benchmark showing state-of-the-art performance.

Authors:Jinhui Zheng, Xueyuan Gong
Title: ExpFace: Exponential Angular Margin Loss for Deep Face Recognition
Abstract:
Face recognition is an open-set problem requiring high discriminative power to ensure that intra-class distances remain smaller than inter-class distances. Margin-based softmax losses, such as SphereFace, CosFace, and ArcFace, have been widely adopted to enhance intra-class compactness and inter-class separability, yet they overlook the impact of noisy samples. By examining the distribution of samples in the angular space, we observe that clean samples predominantly cluster in the center region, whereas noisy samples tend to shift toward the peripheral region. Motivated by this observation, we propose the Exponential Angular Margin Loss (ExpFace), which introduces an angular exponential term as the margin. This design applies a larger penalty in the center region and a smaller penalty in the peripheral region within the angular space, thereby emphasizing clean samples while suppressing noisy samples. We present a unified analysis of ExpFace and classical margin-based softmax losses in terms of margin embedding forms, similarity curves, and gradient curves, showing that ExpFace not only avoids the training instability of SphereFace and the non-monotonicity of ArcFace, but also exhibits a similarity curve that applies penalties in the same manner as the decision boundary in the angular space. Extensive experiments demonstrate that ExpFace achieves state-of-the-art performance. To facilitate future research, we have released the source code at: https://github.com/dfr-code/ExpFace.
中文: 提出的指数化角度间隔损失(ExpFace)通过在角度空间中加大对中心区域干净样本的惩罚、减小对边缘噪声样本的惩罚,有效提升了人脸识别的判别能力,在克服现有方法缺陷的同时实现了最优性能。
English: The proposed Exponential Angular Margin Loss (ExpFace) enhances face recognition by applying larger penalties to centrally clustered clean samples and smaller penalties to peripheral noisy samples in angular space, achieving state-of-the-art performance while addressing limitations of previous methods.

Authors:Yi Yang
Title: nnFilterMatch: A Unified Semi-Supervised Learning Framework with Uncertainty-Aware Pseudo-Label Filtering for Efficient Medical Segmentation
Abstract:
Semi-supervised learning (SSL) has emerged as a promising paradigm in medical image segmentation, offering competitive performance while substantially reducing the need for extensive manual annotation. When combined with active learning (AL), these strategies further minimize annotation burden by selectively incorporating the most informative samples. However, conventional SSL_AL hybrid approaches often rely on iterative and loop-based retraining cycles after each annotation round, incurring significant computational overhead and limiting scalability in clinical applications. In this study, we present a novel, annotation-efficient, and self-adaptive deep segmentation framework that integrates SSL with entropy-based pseudo-label filtering (FilterMatch), an AL-inspired mechanism, within the single-pass nnU-Net training segmentation framework (nnFilterMatch). By selectively excluding high-confidence pseudo-labels during training, our method circumvents the need for retraining loops while preserving the benefits of uncertainty-guided learning. We validate the proposed framework across multiple clinical segmentation benchmarks and demonstrate that it achieves performance comparable to or exceeding fully supervised models, even with only 5\%--20\% labeled data. This work introduces a scalable, end-to-end learning strategy for reducing annotation demands in medical image segmentation without compromising accuracy. Code is available here: https://github.com/Ordi117/nnFilterMatch.git.
中文: 本研究提出nnFilterMatch框架,将主动学习启发的伪标签过滤机制融入半监督学习,仅需5%-20%标注数据即可获得与全监督模型相当的医学图像分割效果,同时避免了传统方法的重训练循环。
English: This study introduces nnFilterMatch, a novel semi-supervised learning framework that integrates active learning-inspired pseudo-label filtering to achieve medical image segmentation performance comparable to fully supervised models using only 5%-20% labeled data, eliminating the need for computational retraining loops.

Authors:Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li
Title: HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST
Abstract:
Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at https://github.com/carsonz/HiCoLoRA.
中文摘要:HiCoLoRA通过分层LoRA架构、谱聚类联合域-槽识别和语义增强初始化,解决了零样本对话状态跟踪中的语义对齐难题,在MultiWOZ和SGD数据集上实现了最优性能。
English Summary: HiCoLoRA introduces a hierarchical LoRA framework with spectral clustering and semantic-enhanced initialization to address semantic misalignment in zero-shot dialog state tracking, achieving state-of-the-art performance on MultiWOZ and SGD datasets.

Authors:Shuyu Zhang, Yifan Wei, Xinru Wang, Yanmin Zhu, Yangfan He, Yixuan Weng, Bin Li
Title: HiCoLoRA: Addressing Context-Prompt Misalignment via Hierarchical Collaborative LoRA for Zero-Shot DST
Abstract:
Zero-shot Dialog State Tracking (zs-DST) is essential for enabling Task-Oriented Dialog Systems (TODs) to generalize to new domains without costly data annotation. A central challenge lies in the semantic misalignment between dynamic dialog contexts and static prompts, leading to inflexible cross-layer coordination, domain interference, and catastrophic forgetting. To tackle this, we propose Hierarchical Collaborative Low-Rank Adaptation (HiCoLoRA), a framework that enhances zero-shot slot inference through robust prompt alignment. It features a hierarchical LoRA architecture for dynamic layer-specific processing (combining lower-layer heuristic grouping and higher-layer full interaction), integrates Spectral Joint Domain-Slot Clustering to identify transferable associations (feeding an Adaptive Linear Fusion Mechanism), and employs Semantic-Enhanced SVD Initialization (SemSVD-Init) to preserve pre-trained knowledge. Experiments on multi-domain datasets MultiWOZ and SGD show that HiCoLoRA outperforms baselines, achieving SOTA in zs-DST. Code is available at https://github.com/carsonz/HiCoLoRA.
中文摘要:HiCoLoRA通过分层LoRA架构、谱聚类联合域-槽识别和语义增强初始化,解决了零样本对话状态跟踪中的语义对齐难题,在MultiWOZ和SGD数据集上实现了最优性能。
English Summary: HiCoLoRA introduces a hierarchical LoRA framework with spectral clustering and semantic-enhanced initialization to address semantic misalignment in zero-shot dialog state tracking, achieving state-of-the-art performance on MultiWOZ and SGD datasets.

Authors:Jiesi Hu, Yanwu Yang, Zhiyu Ye, Chenfei Ye, Hanyang Peng, Jianfeng Cao, Ting Ma
Title: Towards Robust In-Context Learning for Medical Image Segmentation via Data Synthesis
Abstract:
The rise of In-Context Learning (ICL) for universal medical image segmentation has introduced an unprecedented demand for large-scale, diverse datasets for training, exacerbating the long-standing problem of data scarcity. While data synthesis offers a promising solution, existing methods often fail to simultaneously achieve both high data diversity and a domain distribution suitable for medical data. To bridge this gap, we propose \textbf{SynthICL}, a novel data synthesis framework built upon domain randomization. SynthICL ensures realism by leveraging anatomical priors from real-world datasets, generates diverse anatomical structures to cover a broad data distribution, and explicitly models inter-subject variations to create data cohorts suitable for ICL. Extensive experiments on four held-out datasets validate our framework's effectiveness, showing that models trained with our data achieve performance gains of up to 63\% in average Dice and substantially enhanced generalization to unseen anatomical domains. Our work helps mitigate the data bottleneck for ICL-based segmentation, paving the way for robust models. Our code and the generated dataset are publicly available at https://github.com/jiesihu/Neuroverse3D.
中文摘要:提出的SynthICL框架通过领域随机化生成解剖学真实且多样化的合成数据,有效解决了通用医学图像分割中的数据稀缺问题,在四个数据集上实现了高达63%的性能提升并显著增强了泛化能力。
English Summary: The proposed SynthICL framework overcomes data scarcity in universal medical image segmentation by generating anatomically realistic and diverse synthetic data through domain randomization, achieving up to 63% performance improvement and enhanced generalization across four datasets.

Authors:J. Ben Tamo, Nishant S. Chouhan, Micky C. Nnamdi, Yining Yuan, Shreya S. Chivilkar, Wenqi Shi, Steven W. Hwang, B. Randall Brenn, May D. Wang
Title: Causal Machine Learning for Surgical Interventions
Abstract:
Surgical decision-making is complex and requires understanding causal relationships between patient characteristics, interventions, and outcomes. In high-stakes settings like spinal fusion or scoliosis correction, accurate estimation of individualized treatment effects (ITEs) remains limited due to the reliance on traditional statistical methods that struggle with complex, heterogeneous data. In this study, we develop a multi-task meta-learning framework, X-MultiTask, for ITE estimation that models each surgical decision (e.g., anterior vs. posterior approach, surgery vs. no surgery) as a distinct task while learning shared representations across tasks. To strengthen causal validity, we incorporate the inverse probability weighting (IPW) into the training objective. We evaluate our approach on two datasets: (1) a public spinal fusion dataset (1,017 patients) to assess the effect of anterior vs. posterior approaches on complication severity; and (2) a private AIS dataset (368 patients) to analyze the impact of posterior spinal fusion (PSF) vs. non-surgical management on patient-reported outcomes (PROs). Our model achieves the highest average AUC (0.84) in the anterior group and maintains competitive performance in the posterior group (0.77). It outperforms baselines in treatment effect estimation with the lowest overall $ε_{\text{NN-PEHE}}$ (0.2778) and $ε_{\text{ATE}}$ (0.0763). Similarly, when predicting PROs in AIS, X-MultiTask consistently shows superior performance across all domains, with $ε_{\text{NN-PEHE}}$ = 0.2551 and $ε_{\text{ATE}}$ = 0.0902. By providing robust, patient-specific causal estimates, X-MultiTask offers a powerful tool to advance personalized surgical care and improve patient outcomes. The code is available at https://github.com/Wizaaard/X-MultiTask.
Chinese: 本研究提出X-MultiTask多任务元学习框架,通过整合逆概率加权改进手术决策中的个体化治疗效果评估,在脊柱融合术和青少年特发性脊柱侧凸的预后预测中展现出优于基准方法的性能。
English: The study introduces X-MultiTask, a multi-task meta-learning framework that enhances individualized treatment effect estimation in surgical decisions by incorporating inverse probability weighting, demonstrating superior performance in predicting outcomes for spinal fusion and adolescent idiopathic scoliosis compared to baseline methods.

Authors:Shuyu Zhang, Yifan Wei, Jialuo Yuan, Xinru Wang, Yanmin Zhu, Bin Li
Title: DyBBT: Dynamic Balance via Bandit inspired Targeting for Dialog Policy with Cognitive Dual-Systems
Abstract:
Task oriented dialog systems often rely on static exploration strategies that do not adapt to dynamic dialog contexts, leading to inefficient exploration and suboptimal performance. We propose DyBBT, a novel dialog policy learning framework that formalizes the exploration challenge through a structured cognitive state space capturing dialog progression, user uncertainty, and slot dependency. DyBBT proposes a bandit inspired meta-controller that dynamically switches between a fast intuitive inference (System 1) and a slow deliberative reasoner (System 2) based on real-time cognitive states and visitation counts. Extensive experiments on single- and multi-domain benchmarks show that DyBBT achieves state-of-the-art performance in success rate, efficiency, and generalization, with human evaluations confirming its decisions are well aligned with expert judgment. Code is available at https://github.com/carsonz/DyBBT.
中文摘要:DyBBT提出了一种动态对话策略框架,通过认知状态空间和双系统元控制器实现自适应探索,从而取得了最优性能表现。
English Summary: DyBBT introduces a dynamic dialog policy framework using a cognitive state space and dual-system meta-controller to achieve state-of-the-art performance through adaptive exploration.

Authors:Ling Lo, Kelvin C. K. Chan, Wen-Huang Cheng, Ming-Hsuan Yang
Title: From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition
Abstract:
Existing models often struggle with complex temporal changes, particularly when generating videos with gradual attribute transitions. The most common prompt interpolation approach for motion transitions often fails to handle gradual attribute transitions, where inconsistencies tend to become more pronounced. In this work, we propose a simple yet effective method to extend existing models for smooth and consistent attribute transitions, through introducing frame-wise guidance during the denoising process. Our approach constructs a data-specific transitional direction for each noisy latent, guiding the gradual shift from initial to final attributes frame by frame while preserving the motion dynamics of the video. Moreover, we present the Controlled-Attribute-Transition Benchmark (CAT-Bench), which integrates both attribute and motion dynamics, to comprehensively evaluate the performance of different models. We further propose two metrics to assess the accuracy and smoothness of attribute transitions. Experimental results demonstrate that our approach performs favorably against existing baselines, achieving visual fidelity, maintaining alignment with text prompts, and delivering seamless attribute transitions. Code and CATBench are released: https://github.com/lynn-ling-lo/Prompt2Progression.
Chinese: 本文提出了一种在去噪过程中引入逐帧引导的方法,增强了视频生成模型处理属性渐变的能力,同时保持运动动态,并通过新基准和指标验证了其优越性能。
English: This paper introduces a method that enhances video generation models by incorporating frame-wise guidance during denoising, enabling smooth attribute transitions while preserving motion dynamics, and validates its effectiveness with a new benchmark and metrics.

Authors:Kunlun Xu, Yibo Feng, Jiangmeng Li, Yongsheng Qi, Jiahuan Zhou
Title: C${}^2$Prompt: Class-aware Client Knowledge Interaction for Federated Continual Learning
Abstract:
Federated continual learning (FCL) tackles scenarios of learning from continuously emerging task data across distributed clients, where the key challenge lies in addressing both temporal forgetting over time and spatial forgetting simultaneously. Recently, prompt-based FCL methods have shown advanced performance through task-wise prompt communication.In this study, we underscore that the existing prompt-based FCL methods are prone to class-wise knowledge coherence between prompts across clients. The class-wise knowledge coherence includes two aspects: (1) intra-class distribution gap across clients, which degrades the learned semantics across prompts, (2) inter-prompt class-wise relevance, which highlights cross-class knowledge confusion. During prompt communication, insufficient class-wise coherence exacerbates knowledge conflicts among new prompts and induces interference with old prompts, intensifying both spatial and temporal forgetting. To address these issues, we propose a novel Class-aware Client Knowledge Interaction (C${}^2$Prompt) method that explicitly enhances class-wise knowledge coherence during prompt communication. Specifically, a local class distribution compensation mechanism (LCDC) is introduced to reduce intra-class distribution disparities across clients, thereby reinforcing intra-class knowledge consistency. Additionally, a class-aware prompt aggregation scheme (CPA) is designed to alleviate inter-class knowledge confusion by selectively strengthening class-relevant knowledge aggregation. Extensive experiments on multiple FCL benchmarks demonstrate that C${}^2$Prompt achieves state-of-the-art performance. Our source code is available at https://github.com/zhoujiahuan1991/NeurIPS2025-C2Prompt
中文摘要:本研究针对基于提示的联邦持续学习中存在的类间知识一致性问题,提出了C²Prompt方法,通过增强类内一致性和减少类间混淆来缓解空间和时间遗忘。
English Summary: This study addresses class-wise knowledge coherence issues in prompt-based federated continual learning by proposing the C²Prompt method, which enhances intra-class consistency and reduces inter-class confusion to mitigate both spatial and temporal forgetting.

Authors:Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, Peter Stone
Title: RoboSSM: Scalable In-context Imitation Learning via State-Space Models
Abstract:
In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.
中文:RoboSSM提出了一种基于状态空间模型的可扩展上下文模仿学习方法,通过高效处理长上下文并在新任务上表现稳健,超越了基于Transformer的方法。
English: RoboSSM introduces a scalable in-context imitation learning approach using state-space models, outperforming Transformer-based methods with efficient long-context handling and robust performance on novel tasks.

Authors:Yijun Yuan
Title: Formalization of Harder-Narasimhan theory
Abstract:
The Harder-Narasimhan theory provides a canonical filtration of a vector bundle on a projective curve whose successive quotients are semistable with strictly decreasing slopes. In this article, we present the formalization of Harder-Narasimhan theory in the proof assistant Lean 4 with Mathlib. This formalization is based on a recent approach of Harder-Narasimhan theory by Chen and Jeannin, which reinterprets the theory in order-theoretic terms and avoids the classical dependence on algebraic geometry. As an application, we formalize the uniqueness of coprimary filtration of a finitely generated module over a noetherian ring, and the existence of the Jordan-Hölder filtration of a semistable Harder-Narasimhan game. Code available at: https://github.com/YijunYuan/HarderNarasimhan
中文: 本文基于Chen和Jeannin提出的序理论方法,在Lean 4中利用Mathlib实现了Harder-Narasimhan理论的形式化,该方法避免了传统代数几何的依赖关系。
English: This article presents the formalization of Harder-Narasimhan theory in Lean 4 using Mathlib, based on Chen and Jeannin's order-theoretic approach that avoids classical algebraic geometry dependencies.

Authors:Juan Manuel Perez, Kevin Garcia, Brooklyn Berry, Dongjin Song, Yifeng Gao
Title: Adaptive von Mises-Fisher Likelihood Loss for Supervised Deep Time Series Hashing
Abstract:
Indexing time series by creating compact binary representations is a fundamental task in time series data mining. Recently, deep learning-based hashing methods have proven effective for indexing time series based on semantic meaning rather than just raw similarity. The purpose of deep hashing is to map samples with the same semantic meaning to identical binary hash codes, enabling more efficient search and retrieval. Unlike other supervised representation learning methods, supervised deep hashing requires a discretization step to convert real-valued representations into binary codes, but this can induce significant information loss. In this paper, we propose a von Mises-Fisher (vMF) hashing loss. The proposed deep hashing model maps data to an M-dimensional hyperspherical space to effectively reduce information loss and models each data class as points following distinct vMF distributions. The designed loss aims to maximize the separation between each modeled vMF distribution to provide a better way to maximize the margin between each semantically different data sample. Experimental results show that our method outperforms existing baselines. The implementation is publicly available at https://github.com/jmpq97/vmf-hashing
中文摘要:本文提出了一种冯·米塞斯-费希尔哈希方法,将时间序列数据映射到超球面空间以减少二进制编码过程中的信息损失,实验证明其性能优于现有基准方法。
English Summary: This paper introduces a von Mises-Fisher hashing method that maps time series data to a hyperspherical space to minimize information loss during binary encoding, demonstrating superior performance over existing approaches.

Authors:Rohan Surana, Amit Namburi, Gagan Mundada, Abhay Lal, Zachary Novack, Julian McAuley, Junda Wu
Title: MusiCRS: Benchmarking Audio-Centric Conversational Recommendation
Abstract:
Conversational recommendation has advanced rapidly with large language models (LLMs), yet music remains a uniquely challenging domain where effective recommendations require reasoning over audio content beyond what text or metadata can capture. We present MusiCRS, the first benchmark for audio-centric conversational recommendation that links authentic user conversations from Reddit with corresponding audio tracks. MusiCRS contains 477 high-quality conversations spanning diverse genres (classical, hip-hop, electronic, metal, pop, indie, jazz) with 3,589 unique musical entities and audio grounding via YouTube links. MusiCRS enables evaluation across three input modality configurations: audio-only, query-only, and audio+query (multimodal), allowing systematic comparison of audio-LLMs, retrieval models, and traditional approaches. Our experiments reveal that current systems rely heavily on textual signals and struggle with nuanced audio reasoning. This exposes fundamental limitations in cross-modal knowledge integration where models excel at dialogue semantics but cannot effectively ground abstract musical concepts in actual audio content. To facilitate progress, we release the MusiCRS dataset (https://huggingface.co/datasets/rohan2810/MusiCRS), evaluation code (https://github.com/rohan2810/musiCRS), and comprehensive baselines.
中文摘要:MusiCRS是首个以音频为中心的对话推荐基准,将真实Reddit对话与音乐曲目相连接,揭示了当前系统虽擅长文本处理但在音频推理方面存在明显不足。
English Summary: MusiCRS is the first audio-focused conversational recommendation benchmark that connects real Reddit discussions with music tracks, revealing current systems' limitations in audio reasoning despite strong text processing capabilities.

Authors:Yifan Ye, Jun Cen, Jing Chen, Zhihe Lu
Title: Self-evolved Imitation Learning in Simulated World
Abstract:
Imitation learning has been a trend recently, yet training a generalist agent across multiple tasks still requires large-scale expert demonstrations, which are costly and labor-intensive to collect. To address the challenge of limited supervision, we propose Self-Evolved Imitation Learning (SEIL), a framework that progressively improves a few-shot model through simulator interactions. The model first attempts tasksin the simulator, from which successful trajectories are collected as new demonstrations for iterative refinement. To enhance the diversity of these demonstrations, SEIL employs dual-level augmentation: (i) Model-level, using an Exponential Moving Average (EMA) model to collaborate with the primary model, and (ii) Environment-level, introducing slight variations in initial object positions. We further introduce a lightweight selector that filters complementary and informative trajectories from the generated pool to ensure demonstration quality. These curated samples enable the model to achieve competitive performance with far fewer training examples. Extensive experiments on the LIBERO benchmark show that SEIL achieves a new state-of-the-art performance in few-shot imitation learning scenarios. Code is available at https://github.com/Jasper-aaa/SEIL.git.
中文: SEIL是一种自演进的模仿学习框架,通过模拟器交互、双层级增强和轨迹筛选,在少量专家示范下显著提升模型性能,实现了最先进的少样本学习效果。
English: SEIL is a self-evolved imitation learning framework that enhances few-shot model performance through simulator interactions, dual-level augmentation, and trajectory selection, achieving state-of-the-art results with minimal expert demonstrations.

Authors:Sahil Tyagi, Andrei Cozma, Olivera Kotevska, Feiyi Wang
Title: OmniFed: A Modular Framework for Configurable Federated Learning from Edge to HPC
Abstract:
Federated Learning (FL) is critical for edge and High Performance Computing (HPC) where data is not centralized and privacy is crucial. We present OmniFed, a modular framework designed around decoupling and clear separation of concerns for configuration, orchestration, communication, and training logic. Its architecture supports configuration-driven prototyping and code-level override-what-you-need customization. We also support different topologies, mixed communication protocols within a single deployment, and popular training algorithms. It also offers optional privacy mechanisms including Differential Privacy (DP), Homomorphic Encryption (HE), and Secure Aggregation (SA), as well as compression strategies. These capabilities are exposed through well-defined extension points, allowing users to customize topology and orchestration, learning logic, and privacy/compression plugins, all while preserving the integrity of the core system. We evaluate multiple models and algorithms to measure various performance metrics. By unifying topology configuration, mixed-protocol communication, and pluggable modules in one stack, OmniFed streamlines FL deployment across heterogeneous environments. Github repository is available at https://github.com/at-aaims/OmniFed.
中文: OmniFed是一个模块化的联邦学习框架,通过可插拔架构支持灵活配置、多种拓扑结构和隐私保护机制,简化了异构环境中的部署流程。
English: OmniFed is a modular federated learning framework that enables flexible configuration, supports diverse topologies and privacy mechanisms, and streamlines deployment across heterogeneous environments through its pluggable architecture.

Authors:Axel Marmoret, Reda Bensaid, Jonathan Lys, Vincent Gripon, François Leduc-Primeau
Title: TensLoRA: Tensor Alternatives for Low-Rank Adaptation
Abstract:
Low-Rank Adaptation (LoRA) is widely used to efficiently adapt Transformers by adding trainable low-rank matrices to attention projections. While effective, these matrices are considered independent for each attention projection (Query, Key, and Value) and each layer. Recent extensions have considered joint, tensor-based adaptations, but only in limited forms and without a systematic framework. We introduce TensLoRA, a unified framework that aggregates LoRA updates into higher-order tensors and models a broad family of tensor-based low-rank adaptations. Our formulation generalizes existing tensor-based methods and enables mode-specific compression rates, allowing parameter budgets to be tailored according to the modality and task. Experiments on vision and language benchmarks reveal that the tensor construction directly impacts performance, sometimes better than standard LoRA under similar parameter counts.
中文: TensLoRA 提出了一个统一框架,将 LoRA 更新聚合为高阶张量,支持模态特定的压缩率,在相似参数限制下某些情况下性能优于标准 LoRA。
English: TensLoRA introduces a unified framework that aggregates LoRA updates into higher-order tensors, enabling mode-specific compression rates and outperforming standard LoRA in some cases under similar parameter constraints.

Authors:Zhijin Guo, Chenhao Xue, Zhaozhen Xu, Hongbo Bo, Yuxuan Ye, Janet B. Pierrehumbert, Martha Lewis
Title: Quantifying Compositionality of Classic and State-of-the-Art Embeddings
Abstract:
For language models to generalize correctly to novel expressions, it is critical that they exploit access compositional meanings when this is justified. Even if we don't know what a "pelp" is, we can use our knowledge of numbers to understand that "ten pelps" makes more pelps than "two pelps". Static word embeddings such as Word2vec made strong, indeed excessive, claims about compositionality. The SOTA generative, transformer models and graph models, however, go too far in the other direction by providing no real limits on shifts in meaning due to context. To quantify the additive compositionality, we formalize a two-step, generalized evaluation that (i) measures the linearity between known entity attributes and their embeddings via canonical correlation analysis, and (ii) evaluates additive generalization by reconstructing embeddings for unseen attribute combinations and checking reconstruction metrics such as L2 loss, cosine similarity, and retrieval accuracy. These metrics also capture failure cases where linear composition breaks down. Sentences, knowledge graphs, and word embeddings are evaluated and tracked the compositionality across all layers and training stages. Stronger compositional signals are observed in later training stages across data modalities, and in deeper layers of the transformer-based model before a decline at the top layer. Code is available at https://github.com/Zhijin-Guo1/quantifying-compositionality.
Chinese: 本研究提出了一种量化语言模型加法组合性的两步评估方法,发现在后期训练阶段和深层网络中存在更强的组合性信号,但在顶层出现下降。
English: This study introduces a two-step evaluation method to quantify additive compositionality in language models, revealing stronger compositional signals in later training stages and deeper layers before a decline at the top layer.

Authors:Enhao Huang, Zhiyu Zhang, Tianxiang Xu, Chunshu Xia, Kaichun Hu, Yuchen Yang, Tongtong Pan, Dong Dong, Zhan Qin
Title: Holographic Transformers for Complex-Valued Signal Processing: Integrating Phase Interference into Self-Attention
Abstract:
Complex-valued signals encode both amplitude and phase, yet most deep models treat attention as real-valued correlation, overlooking interference effects. We introduce the Holographic Transformer, a physics-inspired architecture that incorporates wave interference principles into self-attention. Holographic attention modulates interactions by relative phase and coherently superimposes values, ensuring consistency between amplitude and phase. A dual-headed decoder simultaneously reconstructs the input and predicts task outputs, preventing phase collapse when losses prioritize magnitude over phase. We demonstrate that holographic attention implements a discrete interference operator and maintains phase consistency under linear mixing. Experiments on PolSAR image classification and wireless channel prediction show strong performance, achieving high classification accuracy and F1 scores, low regression error, and increased robustness to phase perturbations. These results highlight that enforcing physical consistency in attention leads to generalizable improvements in complex-valued learning and provides a unified, physics-based framework for coherent signal modeling. The code is available at https://github.com/EonHao/Holographic-Transformers.
中文摘要:全息变换器将波动干涉原理引入自注意力机制,确保复值信号中幅度与相位的一致性,在极化SAR图像分类和无线信道预测等任务中展现出卓越的鲁棒性和准确性。
English Summary: The Holographic Transformer integrates wave interference principles into self-attention to maintain phase consistency in complex-valued signals, demonstrating superior performance in tasks like PolSAR classification and wireless prediction through enhanced robustness and accuracy.

Authors:Ruochi Li, Haoxuan Zhang, Edward Gehringer, Ting Xiao, Junhua Ding, Haihua Chen
Title: Unveiling the Merits and Defects of LLMs in Automatic Review Generation for Scientific Papers
Abstract:
The surge in scientific submissions has placed increasing strain on the traditional peer-review process, prompting the exploration of large language models (LLMs) for automated review generation. While LLMs demonstrate competence in producing structured and coherent feedback, their capacity for critical reasoning, contextual grounding, and quality sensitivity remains limited. To systematically evaluate these aspects, we propose a comprehensive evaluation framework that integrates semantic similarity analysis and structured knowledge graph metrics to assess LLM-generated reviews against human-written counterparts. We construct a large-scale benchmark of 1,683 papers and 6,495 expert reviews from ICLR and NeurIPS in multiple years, and generate reviews using five LLMs. Our findings show that LLMs perform well in descriptive and affirmational content, capturing the main contributions and methodologies of the original work, with GPT-4o highlighted as an illustrative example, generating 15.74% more entities than human reviewers in the strengths section of good papers in ICLR 2025. However, they consistently underperform in identifying weaknesses, raising substantive questions, and adjusting feedback based on paper quality. GPT-4o produces 59.42% fewer entities than real reviewers in the weaknesses and increases node count by only 5.7% from good to weak papers, compared to 50% in human reviews. Similar trends are observed across all conferences, years, and models, providing empirical foundations for understanding the merits and defects of LLM-generated reviews and informing the development of future LLM-assisted reviewing tools. Data, code, and more detailed results are publicly available at https://github.com/RichardLRC/Peer-Review.
中文: 该研究评估了大语言模型在自动同行评审中的应用,发现其虽能有效总结论文优点,但在批判性分析和根据论文质量调整反馈方面表现不足,这一结论基于对大量学术论文和评审的大规模基准测试得出。
English: The study evaluates large language models (LLMs) for automated peer review, finding they excel in summarizing strengths but struggle with critical analysis and adapting feedback to paper quality, as demonstrated through a comprehensive benchmark of academic papers and reviews.

Authors:Millie Vyas, Timothy Blattner, Alden Dima
Title: Readme_AI: Dynamic Context Construction for Large Language Models
Abstract:
Despite being trained on significant amounts of data, Large Language Models (LLMs) can provide inaccurate or unreliable information in the context of a user's specific query. Given query-specific context significantly improves the usefulness of its responses. In this paper, we present a specification that can be used to dynamically build context for data sources. The data source owner creates the file containing metadata for LLMs to use when reasoning about dataset-related queries. To demonstrate our proposed specification, we created a prototype Readme_AI Model Context Protocol (MCP) server that retrieves the metadata from the data source and uses it to dynamically build context. Some features that make this specification dynamic are the extensible types that represent crawling web-pages, fetching data from data repositories, downloading and parsing publications, and general text. The context is formatted and grouped using user-specified tags that provide clear contextual information for the LLM to reason about the content. We demonstrate the capabilities of this early prototype by asking the LLM about the NIST-developed Hedgehog library, for which common LLMs often provides inaccurate and irrelevant responses containing hallucinations. With Readme_AI, the LLM receives enough context that it is now able to reason about the library and its use, and even generate code interpolated from examples that were included in the Readme_AI file provided by Hedgehog's developer. Our primary contribution is a extensible protocol for dynamically grounding LLMs in specialized, owner-provided data, enhancing responses from LLMs and reducing hallucinations. The source code for the Readme_AI tool is posted here: https://github.com/usnistgov/readme_ai .
中文摘要:本文提出的Readme_AI协议能动态构建数据源上下文,通过让LLMs获取结构化元数据来显著提升回答准确性并减少幻觉现象。
English Summary: This paper introduces Readme_AI, a dynamic protocol that enables LLMs to access structured metadata from data sources, significantly improving response accuracy and reducing hallucinations by providing query-specific context.

Authors:Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Alistair Johnson, Edward Choi, Yugang jia, Jong Ha Lee
Title: FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
Abstract:
The recent shift toward the Health Level Seven Fast Healthcare Interoperability Resources (HL7 FHIR) standard opens a new frontier for clinical AI, demanding LLM agents to navigate complex, resource-based data models instead of conventional structured health data. However, existing benchmarks have lagged behind this transition, lacking the realism needed to evaluate recent LLMs on interoperable clinical data. To bridge this gap, we introduce FHIR-AgentBench, a benchmark that grounds 2,931 real-world clinical questions in the HL7 FHIR standard. Using this benchmark, we systematically evaluate agentic frameworks, comparing different data retrieval strategies (direct FHIR API calls vs. specialized tools), interaction patterns (single-turn vs. multi-turn), and reasoning strategies (natural language vs. code generation). Our experiments highlight the practical challenges of retrieving data from intricate FHIR resources and the difficulty of reasoning over them, both of which critically affect question answering performance. We publicly release the FHIR-AgentBench dataset and evaluation suite (https://github.com/glee4810/FHIR-AgentBench) to promote reproducible research and the development of robust, reliable LLM agents for clinical applications.
中文: 本研究推出FHIR-AgentBench基准,利用真实临床数据基于HL7 FHIR标准评估大语言模型代理的数据检索与推理能力,填补现有评估空白,推动临床人工智能应用的稳健发展。
English: The study introduces FHIR-AgentBench, a benchmark using real-world clinical data in the HL7 FHIR standard to evaluate LLM agents' performance in data retrieval and reasoning, addressing gaps in existing assessments and promoting development for clinical AI applications.

Authors:Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Zheng Zhu, Donny Y. Chen, Bohan Zhuang
Title: VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction
Abstract:
Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment's reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over Gaussian density based on 3D scene complexity, yielding more faithful Gaussian point clouds, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks including RealEstate10K and ScanNet demonstrate that VolSplat achieves state-of-the-art performance while producing more plausible and view-consistent Gaussian reconstructions. In addition to superior results, our approach establishes a more scalable framework for feed-forward 3D reconstruction with denser and more robust representations, paving the way for further research in wider communities. The video results, code and trained models are available on our project page: https://lhmd.top/volsplat.
Chinese: VolSplat提出了一种体素对齐的高斯范式,克服了像素对齐方法的局限,在真实感视图合成中实现了最先进的性能,同时提升了几何一致性和重建鲁棒性。
English: VolSplat introduces a voxel-aligned Gaussian paradigm that overcomes the limitations of pixel-aligned methods, achieving state-of-the-art performance in novel view synthesis with improved geometric consistency and more robust 3D reconstructions.

Authors:Gabriel Maldonado, Narges Rashvand, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi
Title: Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps
Abstract:
Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method's superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion's complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at https://github.com/TeCSAR-UNCC/Pose-Quantization.
中文: 本研究提出了一种对抗性优化的VQ-GAN框架,通过密集运动标记化有效压缩人体运动数据,在重建质量和时间稳定性上显著优于基线方法,同时揭示了2D和3D运动表征的最佳词汇量规模。
English: This study presents an adversarially-refined VQ-GAN framework that effectively compresses human motion data using dense tokenization, significantly outperforming baselines in reconstruction quality and temporal stability while revealing optimal vocabulary sizes for 2D and 3D motion representation.

Authors:Chunhao Tian, Yutong Wang, Xuebo Liu, Zhexuan Wang, Liang Ding, Miao Zhang, Min Zhang
Title: AgentInit: Initializing LLM-based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration
Abstract:
Proper initialization is crucial for any system, particularly in multi-agent systems (MAS), where it plays a pivotal role in determining both the system's efficiency and effectiveness. However, existing MAS initialization methods do not fully account for the collaborative needs of the generated agents in subsequent stages. Inspired by the principles of effective team composition, we propose AgentInit, which aims to optimize the structure of agent teams. Specifically, in addition to multi-round interactions and reflections between agents during agent generation, AgentInit incorporates a Natural Language to Format mechanism to ensure consistency and standardization. Balanced team selection strategies using Pareto principles are subsequently applied to jointly consider agent team diversity and task relevance to promote effective and efficient collaboration and enhance overall system performance. Experiments show that AgentInit consistently outperforms state-of-the-art initialization methods and pre-defined strategies across various frameworks and tasks, achieving an overall performance improvement of up to 1.2 and 1.6, respectively, while also significantly reducing token consumption. Further analysis confirms its strong transferability to similar tasks and verifies the effectiveness of its key components, demonstrating its capability and adaptability as a reliable MAS initialization method. Source code and models are available at https://github.com/1737423697/AgentInit.
中文: AgentInit作为一种新型多智能体系统初始化方法,通过结构化交互、标准化格式和均衡选择策略优化团队构建,在各类任务中实现卓越性能与效率提升,同时显著降低资源消耗。
English: AgentInit, a novel initialization method for multi-agent systems, optimizes team composition through structured interactions, standardized formatting, and balanced selection strategies, achieving superior performance and efficiency across various tasks while reducing resource consumption.

Authors:Ioanna Ntinou, Alexandros Xenos, Yassine Ouali, Adrian Bulat, Georgios Tzimiropoulos
Title: Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
Abstract:
Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: https://github.com/IoannaNti/LexiCLIP
This work introduces a vision-free, single-encoder retrieval method that replaces traditional multimodal models with text-to-text retrieval using structured image descriptions, achieving superior performance while reducing computational costs and privacy concerns.
English Summary:

Authors:Yun Wang, Junjie Hu, Junhui Hou, Chenghao Zhang, Renwei Yang, Dapeng Oliver Wu
Title: RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions
Abstract:
Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods. Codes are available at \textcolor{blue}{https://github.com/cocowy1/RoSe-Robust-Self-supervised-Stereo-Matching-under-Adverse-Weather-Conditions}.
中文: 针对现有自监督立体匹配方法在恶劣天气下因特征提取困难和像素对应关系破坏而性能下降的问题,提出融合视觉基础模型先验与场景对应学习的鲁棒训练范式,有效提升了模型在雨雾等复杂环境下的视差估计精度。
English: Recent self-supervised stereo matching methods struggle in adverse weather due to degraded feature extraction and disrupted pixel correspondences, prompting the development of a robust training paradigm that integrates visual foundation model priors and scene correspondence learning to significantly enhance performance.

Authors:Qingfeng Lan, Gautham Vasan, A. Rupam Mahmood
Title: Efficient Reinforcement Learning by Reducing Forgetting with Elephant Activation Functions
Abstract:
Catastrophic forgetting has remained a significant challenge for efficient reinforcement learning for decades (Ring 1994, Rivest and Precup 2003). While recent works have proposed effective methods to mitigate this issue, they mainly focus on the algorithmic side. Meanwhile, we do not fully understand what architectural properties of neural networks lead to catastrophic forgetting. This study aims to fill this gap by studying the role of activation functions in the training dynamics of neural networks and their impact on catastrophic forgetting in reinforcement learning setup. Our study reveals that, besides sparse representations, the gradient sparsity of activation functions also plays an important role in reducing forgetting. Based on this insight, we propose a new class of activation functions, elephant activation functions, that can generate both sparse outputs and sparse gradients. We show that by simply replacing classical activation functions with elephant activation functions in the neural networks of value-based algorithms, we can significantly improve the resilience of neural networks to catastrophic forgetting, thus making reinforcement learning more sample-efficient and memory-efficient.
中文摘要:本研究发现激活函数的梯度稀疏性对减少强化学习中的灾难性遗忘至关重要,并提出新型大象激活函数,通过产生稀疏输出和梯度来显著增强神经网络的抗遗忘能力。
English Summary: This study identifies gradient sparsity in activation functions as crucial for reducing catastrophic forgetting in reinforcement learning and proposes novel elephant activation functions that enhance neural network resilience by producing sparse outputs and gradients.

Authors:Sarvesh Prajapati, Ananya Trivedi, Nathaniel Hanson, Bruce Maxwell, Taskin Padir
Title: Spectral Signature Mapping from RGB Imagery for Terrain-Aware Navigation
Abstract:
Successful navigation in outdoor environments requires accurate prediction of the physical interactions between the robot and the terrain. To this end, several methods rely on geometric or semantic labels to classify traversable surfaces. However, such labels cannot distinguish visually similar surfaces that differ in material properties. Spectral sensors enable inference of material composition from surface reflectance measured across multiple wavelength bands. Although spectral sensing is gaining traction in robotics, widespread deployment remains constrained by the need for custom hardware integration, high sensor costs, and compute-intensive processing pipelines. In this paper, we present RGB Image to Spectral Signature Neural Network (RS-Net), a deep neural network designed to bridge the gap between the accessibility of RGB sensing and the rich material information provided by spectral data. RS-Net predicts spectral signatures from RGB patches, which we map to terrain labels and friction coefficients. The resulting terrain classifications are integrated into a sampling-based motion planner for a wheeled robot operating in outdoor environments. Likewise, the friction estimates are incorporated into a contact-force-based MPC for a quadruped robot navigating slippery surfaces. Thus, we introduce a framework that learns the task-relevant physical property once during training and thereafter relies solely on RGB sensing at test time. The code is available at https://github.com/prajapatisarvesh/RS-Net.
中文摘要:RS-Net是一种深度学习框架,通过从RGB图像预测光谱特征来估算地形属性,使机器人经过初始训练后仅需普通摄像头即可实现户外环境导航。
English summary: RS-Net is a deep learning framework that predicts spectral signatures from RGB images to estimate terrain properties, enabling robots to navigate outdoor environments using only standard cameras after initial training.

Authors:Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe
Title: 3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference
Abstract:
Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i
中文: Sa2VA-i是Sa2VA的改进版本,通过修正训练与推理间的不一致性问题,在多个视频分割基准测试中创下最新记录并显著提升性能。
English: Sa2VA-i is an enhanced version of Sa2VA that addresses inconsistencies between training and inference, achieving state-of-the-art results on multiple video segmentation benchmarks with significant performance improvements.

Authors:Teng Xiao, Zuchao Li, Lefei Zhang
Title: OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment
Abstract:
Recent advances in multimodal large language models (LLMs) have led to significant progress in understanding, generation, and retrieval tasks. However, current solutions often treat these tasks in isolation or require training LLMs from scratch, resulting in high computational costs and limited generalization across modalities. In this work, we present OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module. To address the challenge of task interference, we propose a two-stage decoupled training strategy: supervised fine-tuning and latent space alignment for aligning LLM behavior with multimodal reasoning, and semantic-guided diffusion training to align cross-modal latent spaces via learnable query embeddings. Extensive experiments across a wide range of benchmarks demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks. Moreover, our results highlight the effectiveness of latent space alignment for unifying multimodal modeling under a shared representation space. Code and models are released at https://github.com/xiao-xt/OmniBridge.
中文:OmniBridge是一个统一的多模态框架,通过以语言为中心的设计和两阶段训练策略,集成了视觉语言理解、生成与检索任务,在多种基准测试中取得优异性能,并突显了潜在空间对齐的有效性。
English: OmniBridge is a unified multimodal framework that integrates vision-language understanding, generation, and retrieval through a language-centric design and a two-stage training strategy, achieving competitive performance across diverse benchmarks while emphasizing the effectiveness of latent space alignment.

Authors:Honghao Chen, Xingzhou Lou, Xiaokun Feng, Kaiqi Huang, Xinlong Wang
Title: Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards
Abstract:
Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at https://github.com/baaivision/CoS.
中文: 本文提出视觉语言模型的逐步推理链方法,通过细粒度奖励评估和强化学习在基准测试中实现稳定提升,同时提供透明框架组件与实证分析启示。
English: This paper introduces chain of step reasoning for vision-language models, enabling fine-grained reward evaluation and reinforcement learning to achieve consistent improvements on benchmarks while providing transparent framework components and empirical insights.

Authors:Lorenzo Shaikewitz, Tim Nguyen, Luca Carlone
Title: Category-Level Object Shape and Pose Estimation in Less Than a Millisecond
Abstract:
Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object's unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario. Code is released at https://github.com/MIT-SPARK/Fast-ShapeAndPose.
Chinese: 本文提出了一种快速局部求解器,用于物体形状和姿态估计,仅需类别级先验知识并提供高效的全局最优性验证,通过自洽场迭代实现每轮约100微秒的快速计算。
English: This paper introduces a fast local solver for object shape and pose estimation that uses category-level priors and provides an efficient global optimality certificate, achieving rapid performance with self-consistent field iteration in about 100 microseconds per iteration.

Authors:Pamela Osuna-Vargas, Altug Kamacioglu, Dominik F. Aschauer, Petros E. Vlachos, Sercan Alipek, Jochen Triesch, Simon Rumpel, Matthias Kaschube
Title: SynapFlow: A Modular Framework Towards Large-Scale Analysis of Dendritic Spines
Abstract:
Dendritic spines are key structural components of excitatory synapses in the brain. Given the size of dendritic spines provides a proxy for synaptic efficacy, their detection and tracking across time is important for studies of the neural basis of learning and memory. Despite their relevance, large-scale analyses of the structural dynamics of dendritic spines in 3D+time microscopy data remain challenging and labor-intense. Here, we present a modular machine learning-based pipeline designed to automate the detection, time-tracking, and feature extraction of dendritic spines in volumes chronically recorded with two-photon microscopy. Our approach tackles the challenges posed by biological data by combining a transformer-based detection module, a depth-tracking component that integrates spatial features, a time-tracking module to associate 3D spines across time by leveraging spatial consistency, and a feature extraction unit that quantifies biologically relevant spine properties. We validate our method on open-source labeled spine data, and on two complementary annotated datasets that we publish alongside this work: one for detection and depth-tracking, and one for time-tracking, which, to the best of our knowledge, is the first data of this kind. To encourage future research, we release our data, code, and pre-trained weights at https://github.com/pamelaosuna/SynapFlow, establishing a baseline for scalable, end-to-end analysis of dendritic spine dynamics.
Chinese: 本研究提出了一种机器学习流程,可自动检测、追踪和分析三维延时显微镜数据中的树突棘,以促进学习和记忆研究。
English: This study introduces a machine learning pipeline that automates the detection, tracking, and feature analysis of dendritic spines in 3D time-lapse microscopy data to facilitate research on learning and memory.

Authors:Gongrui Nan, Siye Chen, Jing Huang, Mengyu Lu, Dexun Wang, Chunmei Xie, Weiqi Xiong, Xianzhou Zeng, Qixuan Zhou, Yadong Li, Xingzhong Xu
Title: NGRPO: Negative-enhanced Group Relative Policy Optimization
Abstract:
RLVR has enhanced the reasoning capabilities of Large Language Models (LLMs) across various tasks. However, GRPO, a representative RLVR algorithm, suffers from a critical limitation: when all responses within a group are either entirely correct or entirely incorrect, the model fails to learn from these homogeneous responses. This is particularly problematic for homogeneously incorrect groups, where GRPO's advantage function yields a value of zero, leading to null gradients and the loss of valuable learning signals. To overcome this issue, we propose NGRPO (Negative-enhanced Group Relative Policy Optimization), an algorithm designed to convert homogeneous errors into robust learning signals. First, NGRPO introduces Advantage Calibration. This mechanism hypothesizes the existence of a virtual maximum-reward sample during advantage calculation, thereby altering the mean and variance of rewards within a group and ensuring that the advantages for homogeneously incorrect samples are no longer zero. Second, NGRPO employs Asymmetric Clipping, which relaxes the update magnitude for positive samples while imposing stricter constraints on that of negative samples. This serves to stabilize the exploration pressure introduced by the advantage calibration. Our experiments on Qwen2.5-Math-7B demonstrate that NGRPO significantly outperforms baselines such as PPO, GRPO, DAPO, and PSR-NSR on mathematical benchmarks including MATH500, AMC23, and AIME2025. These results validate NGRPO's ability to learn from homogeneous errors, leading to stable and substantial improvements in mathematical reasoning. Our code is available at https://github.com/nangongrui-ngr/NGRPO.
Chinese: NGRPO通过引入优势校准和非对称裁剪机制,解决了GRPO算法无法从同质错误中学习的缺陷,在MATH500和AIME2025等数学推理基准上实现了显著性能提升。
English: NGRPO addresses GRPO's limitation of failing to learn from homogeneous incorrect responses by introducing Advantage Calibration and Asymmetric Clipping, significantly improving mathematical reasoning performance in benchmarks like MATH500 and AIME2025.

Authors:Damian Stachura, Joanna Konieczna, Artur Nowak
Title: Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?
Abstract:
Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.
Chinese: 开源大语言模型在生物医学问答中的表现已媲美专有模型,采用集成策略时甚至能超越闭源模型。
English: Open-weight large language models are now performing comparably to proprietary models in biomedical question-answering, sometimes even surpassing them when using ensemble strategies.

Authors:Wenlong Lyu, Yuheng Jia, Hui Liu, Junhui Hou
Title: Graph-based Clustering Revisited: A Relaxation of Kernel $k$-Means Perspective
Abstract:
The well-known graph-based clustering methods, including spectral clustering, symmetric non-negative matrix factorization, and doubly stochastic normalization, can be viewed as relaxations of the kernel $k$-means approach. However, we posit that these methods excessively relax their inherent low-rank, nonnegative, doubly stochastic, and orthonormal constraints to ensure numerical feasibility, potentially limiting their clustering efficacy. In this paper, guided by our theoretical analyses, we propose \textbf{Lo}w-\textbf{R}ank \textbf{D}oubly stochastic clustering (\textbf{LoRD}), a model that only relaxes the orthonormal constraint to derive a probabilistic clustering results. Furthermore, we theoretically establish the equivalence between orthogonality and block diagonality under the doubly stochastic constraint. By integrating \textbf{B}lock diagonal regularization into LoRD, expressed as the maximization of the Frobenius norm, we propose \textbf{B-LoRD}, which further enhances the clustering performance. To ensure numerical solvability, we transform the non-convex doubly stochastic constraint into a linear convex constraint through the introduction of a class probability parameter. We further theoretically demonstrate the gradient Lipschitz continuity of our LoRD and B-LoRD enables the proposal of a globally convergent projected gradient descent algorithm for their optimization. Extensive experiments validate the effectiveness of our approaches. The code is publicly available at https://github.com/lwl-learning/LoRD.
中文摘要:本文提出的LoRD和B-LoRD聚类方法通过策略性地放松约束条件,在保持理论保证和数值可解性的同时有效提升了聚类性能。
English Summary: The paper introduces LoRD and B-LoRD clustering methods that strategically relax constraints to improve clustering performance while maintaining theoretical guarantees and numerical solvability.

Authors:Liting Zhang, Shiwan Zhao, Aobo Kong, Qicheng Li
Title: MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction
Abstract:
Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt-based methods for Large Language Models (LLMs) often rely on single-stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one-size-fits-all designs hinder the full exploitation of LLMs' reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi-agent collaboration into keyphrase extraction. MAPEX coordinates LLM-based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. A dual-path strategy dynamically adapts to document length: knowledge-driven extraction for short texts and topic-guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state-of-the-art unsupervised method by 2.44% and standard LLM baselines by 4.01% in F1@5 on average. Code is available at https://github.com/NKU-LITI/MAPEX.
中文: MAPEX首次将多智能体协作引入关键词提取,通过双路径策略动态适应文档长度,在F1@5指标上平均优于当前最优方法2.44%。
English: MAPEX introduces a multi-agent collaboration framework for keyphrase extraction, dynamically adapting to document length through dual-path strategies and outperforming state-of-the-art methods by 2.44% in F1@5 on average.

Authors:Christian Ganhör, Marta Moscati, Anna Hausberger, Shah Nawaz, Markus Schedl
Title: Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation
Abstract:
Traditional recommender systems rely on collaborative filtering, using past user-item interactions to help users discover new items in a vast collection. In cold start, i.e., when interaction histories of users or items are not available, content-based recommender systems use side information instead. Hybrid recommender systems (HRSs) often employ multimodal learning to combine collaborative and side information, which we jointly refer to as modalities. Though HRSs can provide recommendations when some modalities are missing, their quality degrades. In this work, we utilize single-branch neural networks equipped with weight sharing, modality sampling, and contrastive loss to provide accurate recommendations even in missing modality scenarios by narrowing the modality gap. We compare these networks with multi-branch alternatives and conduct extensive experiments on three datasets. Six accuracy-based and four beyond-accuracy-based metrics help assess the recommendation quality for the different training paradigms and their hyperparameters in warm-start and missing modality scenarios. We quantitatively and qualitatively study the effects of these different aspects on bridging the modality gap. Our results show that single-branch networks achieve competitive performance in warm-start scenarios and are significantly better in missing modality settings. Moreover, our approach leads to closer proximity of an item's modalities in the embedding space. Our full experimental setup is available at https://github.com/hcai-mms/single-branch-networks.
中文: 本研究采用具有权重共享、模态采样和对比损失的单分支神经网络来改进混合推荐系统,通过缩小模态差距,在热启动场景中表现优异,在模态缺失情况下效果更显著。
English: This study introduces single-branch neural networks with weight sharing, modality sampling, and contrastive loss to enhance hybrid recommender systems, achieving competitive performance in warm-start scenarios and superior results in missing modality cases by reducing the modality gap.

Authors:Kuang Xiaodong, Li Bingxuan, Li Yuan, Rao Fan, Ma Gege, Xie Qingguo, Mok Greta S P, Liu Huafeng, Zhu Wentao
Title: A Kernel Space-based Multidimensional Sparse Model for Dynamic PET Image Denoising
Abstract:
Achieving high image quality for temporal frames in dynamic positron emission tomography (PET) is challenging due to the limited statistic especially for the short frames. Recent studies have shown that deep learning (DL) is useful in a wide range of medical image denoising tasks. In this paper, we propose a model-based neural network for dynamic PET image denoising. The inter-frame spatial correlation and intra-frame structural consistency in dynamic PET are used to establish the kernel space-based multidimensional sparse (KMDS) model. We then substitute the inherent forms of the parameter estimation with neural networks to enable adaptive parameters optimization, forming the end-to-end neural KMDS-Net. Extensive experimental results from simulated and real data demonstrate that the neural KMDS-Net exhibits strong denoising performance for dynamic PET, outperforming previous baseline methods. The proposed method may be used to effectively achieve high temporal and spatial resolution for dynamic PET. Our source code is available at https://github.com/Kuangxd/Neural-KMDS-Net/tree/main.
中文: 本文提出了一种基于模型的神经网络KMDS-Net,利用动态PET图像的帧间空间相关性和帧内结构一致性进行去噪处理,在仿真和真实数据实验中均展现出优于现有方法的性能。
English: This paper introduces a neural KMDS-Net, a model-based deep learning approach that leverages inter-frame spatial correlations and intra-frame structural consistency to effectively denoise dynamic PET images, demonstrating superior performance over existing methods in both simulated and real data.

Authors:Lukas Zanger, Bastian Lampe, Lennart Reiher, Lutz Eckstein
Title: Application Management in C-ITS: Orchestrating Demand-Driven Deployments and Reconfigurations
Abstract:
Vehicles are becoming increasingly automated and interconnected, enabling the formation of cooperative intelligent transport systems (C-ITS) and the use of offboard services. As a result, cloud-native techniques, such as microservices and container orchestration, play an increasingly important role in their operation. However, orchestrating applications in a large-scale C-ITS poses unique challenges due to the dynamic nature of the environment and the need for efficient resource utilization. In this paper, we present a demand-driven application management approach that leverages cloud-native techniques - specifically Kubernetes - to address these challenges. Taking into account the demands originating from different entities within the C-ITS, the approach enables the automation of processes, such as deployment, reconfiguration, update, upgrade, and scaling of microservices. Executing these processes on demand can, for example, reduce computing resource consumption and network traffic. A demand may include a request for provisioning an external supporting service, such as a collective environment model. The approach handles changing and new demands by dynamically reconciling them through our proposed application management framework built on Kubernetes and the Robot Operating System (ROS 2). We demonstrate the operation of our framework in the C-ITS use case of collective environment perception and make the source code of the prototypical framework publicly available at https://github.com/ika-rwth-aachen/application_manager .
中文: 本文提出了一种基于Kubernetes等云原生技术的需求驱动应用管理框架,通过动态协调车辆协同智能运输系统中的微服务部署与资源配置,实现自动化运维并提升资源利用效率。
English: This paper introduces a demand-driven application management framework using cloud-native technologies like Kubernetes to automate and optimize microservice orchestration in dynamic cooperative intelligent transport systems, enhancing resource efficiency and adaptability.

Authors:Antoine P. Leeman, Johannes Köhler, Melanie N. Zeilinger
Title: Guaranteed Robust Nonlinear MPC via Disturbance Feedback
Abstract:
Robots must satisfy safety-critical state and input constraints despite disturbances and model mismatch. We introduce a robust model predictive control (RMPC) formulation that is fast, scalable, and compatible with real-time implementation. Our formulation guarantees robust constraint satisfaction, input-to-state stability (ISS) and recursive feasibility. The key idea is to decompose the uncertain nonlinear system into (i) a nominal nonlinear dynamic model, (ii) disturbance-feedback controllers, and (iii) bounds on the model error. These components are optimized jointly using sequential convex programming. The resulting convex subproblems are solved efficiently using a recent disturbance-feedback MPC solver. The approach is validated across multiple dynamics, including a rocket-landing problem with steerable thrust. An open-source implementation is available at https://github.com/antoineleeman/robust-nonlinear-mpc.
中文: 本文提出了一种快速且可扩展的鲁棒模型预测控制方法,通过分解系统不确定性和采用序列凸优化,确保非线性系统在扰动下的安全性与稳定性。
English: This paper presents a fast and scalable robust model predictive control method that ensures safety and stability for nonlinear systems under disturbances by decomposing uncertainty and using sequential convex programming.

Authors:Ruichao Hou, Xingyuan Li, Tongwei Ren, Dongming Zhou, Gangshan Wu, Jinde Cao
Title: HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection
Abstract:
RGB-thermal salient object detection (RGB-T SOD) aims to identify prominent objects by integrating complementary information from RGB and thermal modalities. However, learning the precise boundaries and complete objects remains challenging due to the intrinsic insufficient feature fusion and the extrinsic limitations of data scarcity. In this paper, we propose a novel hybrid prompt-driven segment anything model (HyPSAM), which leverages the zero-shot generalization capabilities of the segment anything model (SAM) for RGB-T SOD. Specifically, we first propose a dynamic fusion network (DFNet) that generates high-quality initial saliency maps as visual prompts. DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction, overcoming the limitations of fixed-parameter kernels and enhancing multi-modal feature representation. Moreover, we propose a plug-and-play refinement network (P2RNet), which serves as a general optimization strategy to guide SAM in refining saliency maps by using hybrid prompts. The text prompt ensures reliable modality input, while the mask and box prompts enable precise salient object localization. Extensive experiments on three public datasets demonstrate that our method achieves state-of-the-art performance. Notably, HyPSAM has remarkable versatility, seamlessly integrating with different RGB-T SOD methods to achieve significant performance gains, thereby highlighting the potential of prompt engineering in this field. The code and results of our method are available at: https://github.com/milotic233/HyPSAM.
中文摘要:本文提出HyPSAM模型,通过动态融合网络生成视觉提示并利用混合提示优化策略,结合SAM的零样本泛化能力提升RGB-热成像显著目标检测性能,在多个公开数据集上达到最优效果。
English Summary: This paper introduces HyPSAM, a hybrid prompt-driven model that enhances RGB-thermal salient object detection by leveraging SAM's zero-shot capabilities through dynamic fusion and refinement networks, achieving state-of-the-art performance across multiple datasets.

Authors:Yingquan Wang, Pingping Zhang, Chong Sun, Dong Wang, Huchuan Lu
Title: What Makes You Unique? Attribute Prompt Composition for Object Re-Identification
Abstract:
Object Re-IDentification (ReID) aims to recognize individuals across non-overlapping camera views. While recent advances have achieved remarkable progress, most existing models are constrained to either single-domain or cross-domain scenarios, limiting their real-world applicability. Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies that may inadvertently suppress identity-specific discriminative cues. To address these limitations, we propose an Attribute Prompt Composition (APC) framework, which exploits textual semantics to jointly enhance discrimination and generalization. Specifically, we design an Attribute Prompt Generator (APG) consisting of a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM). SAD is an over-complete attribute dictionary to provide rich semantic descriptions, while PCM adaptively composes relevant attributes from SAD to generate discriminative attribute-aware features. In addition, motivated by the strong generalization ability of Vision-Language Models (VLM), we propose a Fast-Slow Training Strategy (FSTS) to balance ReID-specific discrimination and generalizable representation learning. Specifically, FSTS adopts a Fast Update Stream (FUS) to rapidly acquire ReID-specific discriminative knowledge and a Slow Update Stream (SUS) to retain the generalizable knowledge inherited from the pre-trained VLM. Through a mutual interaction, the framework effectively focuses on ReID-relevant features while mitigating overfitting. Extensive experiments on both conventional and Domain Generalized (DG) ReID datasets demonstrate that our framework surpasses state-of-the-art methods, exhibiting superior performances in terms of both discrimination and generalization. The source code is available at https://github.com/AWangYQ/APC.
中文摘要:提出的属性提示组合(APC)框架利用文本语义和双流训练策略,在目标重识别任务中同时提升判别性与泛化能力,在多个数据集上超越了现有最优方法。
English Summary: The proposed Attribute Prompt Composition (APC) framework leverages textual semantics and a dual-stream training strategy to enhance both discrimination and generalization in object re-identification, outperforming existing methods across diverse datasets.

Authors:Huanxin Sheng, Xinyi Liu, Hangfeng He, Jieyu Zhao, Jian Kang
Title: Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction
Abstract:
LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its deployment in many applications. This work presents the first framework to analyze the uncertainty by offering a prediction interval of LLM-based scoring via conformal prediction. Conformal prediction constructs continuous prediction intervals from a single evaluation run, and we design an ordinal boundary adjustment for discrete rating tasks. We also suggest a midpoint-based score within the interval as a low-bias alternative to raw model score and weighted average. We perform extensive experiments and analysis, which show that conformal prediction can provide valid prediction interval with coverage guarantees. We also explore the usefulness of interval midpoint and judge reprompting for better judgment.
中文摘要:本研究首次提出一个基于保形预测的框架,用于量化大语言模型评估自然语言生成的不确定性,通过构建预测区间并提供覆盖率保证,同时提出区间中点评分法作为更可靠的评估替代方案。
English Summary: This study introduces a conformal prediction framework to quantify the uncertainty in LLM-based evaluation of natural language generation, providing prediction intervals with coverage guarantees and proposing a midpoint scoring method as a reliable alternative.

Authors:Nicolas Toussaint, Emanuele Colleoni, Ricardo Sanchez-Matilla, Joshua Sutcliffe, Vanessa Thompson, Muhammad Asad, Imanol Luengo, Danail Stoyanov
Title: Zero-shot Monocular Metric Depth for Endoscopic Images
Abstract:
Monocular relative and metric depth estimation has seen a tremendous boost in the last few years due to the sharp advancements in foundation models and in particular transformer based networks. As we start to see applications to the domain of endoscopic images, there is still a lack of robust benchmarks and high-quality datasets in that area. This paper addresses these limitations by presenting a comprehensive benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images, providing critical insights into their generalisation and performance in clinical scenarios. Additionally, we introduce and publish a novel synthetic dataset (EndoSynth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks, designed to bridge the gap between synthetic and real-world data. We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin. By providing both a benchmark and a synthetic dataset, this work advances the field of depth estimation for endoscopic images and serves as an important resource for future research. Project page, EndoSynth dataset and trained weights are available at https://github.com/TouchSurgery/EndoSynth.
中文: 本文提出了一个全面的基准测试和新型合成数据集EndoSynth,旨在提升内窥镜图像的深度估计效果,通过微调显著提高模型精度,并推动临床应用的发展。
English: This paper introduces a comprehensive benchmark and a novel synthetic dataset, EndoSynth, to enhance depth estimation in endoscopic images, significantly improving model accuracy through fine-tuning and advancing clinical applications.

Authors:Yuanhuiyi Lyu, Chi Kit Wong, Chenfei Liao, Lutao Jiang, Xu Zheng, Zexin Lu, Linfeng Zhang, Xuming Hu
Title: Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation
Abstract:
Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce "Image Editing" as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: https://github.com/QC-LY/UiG
中文摘要:本研究提出的理解生成一体化(UiG)框架通过图像编辑将统一模型的强大理解能力融入生成过程,有效提升了文本到图像的生成性能,并在基准测试中显著优于现有方法。
English Summary: The proposed Understanding-in-Generation (UiG) framework integrates image editing to leverage unified models' strong comprehension capabilities, thereby enhancing text-to-image generation performance and achieving significant improvements over existing methods.

Authors:Yara Mohajerani
Title: Adaptive Learning in Spatial Agent-Based Models for Climate Risk Assessment: A Geospatial Framework with Evolutionary Economic Agents
Abstract:
Climate risk assessment requires modelling complex interactions between spatially heterogeneous hazards and adaptive economic systems. We present a novel geospatial agent-based model that integrates climate hazard data with evolutionary learning for economic agents. Our framework combines Mesa-based spatial modelling with CLIMADA climate impact assessment, introducing adaptive learning behaviours that allow firms to evolve strategies for budget allocation, pricing, wages, and risk adaptation through fitness-based selection and mutation. We demonstrate the framework using riverine flood projections under RCP8.5 until 2100, showing that evolutionary adaptation enables firms to converge with baseline (no hazard) production levels after decades of disruption due to climate stress. Our results reveal systemic risks where even agents that are not directly exposed to floods face impacts through supply chain disruptions, with the end-of-century average price of goods 5.6% higher under RCP8.5 compared to the baseline. This open-source framework provides financial institutions and companies with tools to quantify both direct and cascading climate risks while evaluating cost-effective adaptation strategies.
中文摘要:本研究提出了一种结合气候灾害与自适应经济行为的地理空间代理模型,通过洪水预测表明进化学习能使企业在气候干扰后恢复生产水平,同时揭示了导致商品价格显著上涨的系统性供应链风险。
English Summary: This study introduces a geospatial agent-based model integrating climate hazards with adaptive economic behaviors, demonstrating through flood projections that evolutionary learning enables firms to recover production levels despite climate disruptions while revealing systemic supply chain risks causing significant price increases.

Authors:Parsa Vahidi, Omid G. Sani, Maryam M. Shanechi
Title: BRAID: Input-Driven Nonlinear Dynamical Modeling of Neural-Behavioral Data
Abstract:
Neural populations exhibit complex recurrent structures that drive behavior, while continuously receiving and integrating external inputs from sensory stimuli, upstream regions, and neurostimulation. However, neural populations are often modeled as autonomous dynamical systems, with little consideration given to the influence of external inputs that shape the population activity and behavioral outcomes. Here, we introduce BRAID, a deep learning framework that models nonlinear neural dynamics underlying behavior while explicitly incorporating any measured external inputs. Our method disentangles intrinsic recurrent neural population dynamics from the effects of inputs by including a forecasting objective within input-driven recurrent neural networks. BRAID further prioritizes the learning of intrinsic dynamics that are related to a behavior of interest by using a multi-stage optimization scheme. We validate BRAID with nonlinear simulations, showing that it can accurately learn the intrinsic dynamics shared between neural and behavioral modalities. We then apply BRAID to motor cortical activity recorded during a motor task and demonstrate that our method more accurately fits the neural-behavioral data by incorporating measured sensory stimuli into the model and improves the forecasting of neural-behavioral data compared with various baseline methods, whether input-driven or not.
Chinese: BRAID是一种深度学习框架,通过整合外部输入并分离内在循环动态与输入影响来模拟神经动态,从而提高了神经行为数据预测和拟合的准确性。
English: BRAID is a deep learning framework that models neural dynamics by incorporating external inputs and disentangling intrinsic recurrent dynamics from input effects, improving the accuracy of neural-behavioral data forecasting and fitting.

Authors:Jiarui Jin, Haoyu Wang, Xiang Lan, Jun Li, Gaofeng Cheng, Hongyan Li, Shenda Hong
Title: UniECG: Understanding and Generating ECG in One Unified Model
Abstract:
Recent unified models such as GPT-5 have achieved encouraging progress on vision-language tasks. However, these unified models typically fail to correctly understand ECG signals and provide accurate medical diagnoses, nor can they correctly generate ECG signals. To address these limitations, we propose UniECG, the first unified model for ECG capable of concurrently performing evidence-based ECG interpretation and text-conditioned ECG generation tasks. Through a decoupled two-stage training approach, the model first learns evidence-based interpretation skills (ECG-to-Text), and then injects ECG generation capabilities (Text-to-ECG) via latent space alignment. UniECG can autonomously choose to interpret or generate an ECG based on user input, significantly extending the capability boundaries of current ECG models. Our code and checkpoints will be made publicly available at https://github.com/PKUDigitalHealth/UniECG upon acceptance.
中文:UniECG是首个能够通过解耦两阶段训练方法同时实现基于证据的心电图解读和文本条件心电图生成任务的统一模型,有效解决了现有模型在心电图理解和生成方面的不足。
English: UniECG is the first unified model that enables both evidence-based ECG interpretation and text-conditioned ECG generation through a decoupled two-stage training approach, overcoming the limitations of current models in understanding and generating ECG signals.

Authors:Yu Chen, Yifei Han, Long Zhang, Yue Du, Bin Li
Title: TsqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-Tuning
Abstract:
Fine-tuning large pre-trained models for downstream tasks has become a fundamental approach in natural language processing. Fully fine-tuning all model parameters is computationally expensive and memory-intensive, especially in resource-constrained environments. Existing parameter-efficient fine-tuning methods reduce the number of trainable parameters but typically overlook the varying sensitivity of different model layers and the importance of training data. In this work, we propose TsqLoRA, a novel method that integrates data-quality-driven selection with sensitivity-aware low-rank adaptation, consisted of two main components: a quality-aware sampling mechanism for selecting the most informative training data, and a dynamic rank allocation module that adjusts the rank of each layer based on its sensitivity to parameter updates. The experimental results demonstrate that TsqLoRA improves fine-tuning efficiency while maintaining or even improving performance on a variety of NLP tasks. Our code will be available at https://github.com/Benjamin-Ricky/TsqLoRA.
中文:TsqLoRA是一种创新的参数高效微调方法,通过质量感知数据选择和基于敏感性的动态秩分配,在保持或提升多种NLP任务性能的同时显著提高了微调效率。
English: TsqLoRA is a novel parameter-efficient fine-tuning method that combines quality-aware data selection with sensitivity-based dynamic rank allocation to enhance efficiency while maintaining or improving performance across NLP tasks.

Authors:Yaoyao Qian, Yifan Zeng, Yuchao Jiang, Chelsi Jain, Huazheng Wang
Title: The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking
Abstract:
Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the "Ranking Blind Spot", a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: Decision Objective Hijacking, which alters the evaluation goal in pairwise ranking systems, and Decision Criteria Hijacking, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to realistic examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks. Our code is available at: https://github.com/blindspotorg/RankingBlindSpot
中文: 大语言模型存在"排序盲点"漏洞,其比较评估过程会通过决策目标劫持和决策标准劫持被恶意内容提供者操纵,从而人为提升文档排名,实验表明性能更强的模型反而更容易受到此类攻击。
English: Large Language Models exhibit a "Ranking Blind Spot" vulnerability where their comparative evaluation processes can be manipulated through Decision Objective and Criteria Hijacking, allowing malicious content providers to artificially boost document rankings, with experiments showing stronger LLMs are paradoxically more susceptible to these attacks.

Authors:Jiaxun Yang, Yifei Han, Long Zhang, Yujie Liu, Bin Li, Bo Gao, Yangfan He, Kejia Zhan
Title: CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection
Abstract:
Chinese Patronizing and Condescending Language (CPCL) is an implicitly discriminatory toxic speech targeting vulnerable groups on Chinese video platforms. The existing dataset lacks user comments, which are a direct reflection of video content. This undermines the model's understanding of video content and results in the failure to detect some CPLC videos. To make up for this loss, this research reconstructs a new dataset PCLMMPLUS that includes 103k comment entries and expands the dataset size. We also propose the CPCLDetector model with alignment selection and knowledge-enhanced comment content modules. Extensive experiments show the proposed CPCLDetector outperforms the SOTA on PCLMM and achieves higher performance on PCLMMPLUS . CPLC videos are detected more accurately, supporting content governance and protecting vulnerable groups. Code and dataset are available at https://github.com/jiaxunyang256/PCLD.
中文摘要:本研究通过构建包含用户评论的扩展数据集并开发CPCLDetector模型,弥补了中文视频平台上针对弱势群体的施舍性语言检测的不足,提高了识别准确率以加强内容治理。
English Summary: This study addresses the gap in detecting Chinese Patronizing and Condescending Language (CPCL) by creating an expanded dataset with user comments and developing the CPCLDetector model, which improves detection accuracy to better protect vulnerable groups on video platforms.

Authors:Zixin Zhu, Haoxiang Li, Xuelu Feng, He Wu, Chunming Qiao, Junsong Yuan
Title: GeoRemover: Removing Objects and Their Causal Visual Artifacts
Abstract:
Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object's geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at https://github.com/buxiangzhiren/GeoRemover.
This paper introduces a geometry-aware two-stage framework that first removes objects from geometric data using strict mask alignment and then renders photorealistic images, effectively eliminating both target objects and their causal visual artifacts while maintaining structural integrity.
English Summary:

Authors:Zixin Zhu, Haoxiang Li, Xuelu Feng, He Wu, Chunming Qiao, Junsong Yuan
Title: GeoRemover: Removing Objects and Their Causal Visual Artifacts
Abstract:
Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object's geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at https://github.com/buxiangzhiren/GeoRemover.
This paper introduces a geometry-aware two-stage framework that first removes objects from geometric data using strict mask alignment and then renders photorealistic images, effectively eliminating both target objects and their causal visual artifacts while maintaining structural integrity.
English Summary:

Authors:Jin Young Kim, Ji Won Yoon
Title: CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs
Abstract:
Recently, inference-time reasoning strategies have further improved the accuracy of large language models (LLMs), but their effectiveness on smaller models remains unclear. Based on the observation that conventional approaches often fail to improve performance in this context, we propose \textbf{C}ycle-\textbf{C}onsistency in \textbf{Q}uestion \textbf{A}nswering (CCQA), a novel reasoning method that can be effectively applied to SLMs. Inspired by cycle consistency, CCQA generates a question from each reasoning path and answer, evaluates each by its similarity to the original question, and then selects the candidate solution with the highest similarity score as the final response. Since conventional SLMs struggle to generate accurate questions from their own reasoning paths and answers, we employ a lightweight Flan-T5 model specialized for question generation to support this process efficiently. From the experimental results, it is verified that CCQA consistently outperforms existing state-of-the-art (SOTA) methods across eight models on mathematical and commonsense reasoning benchmarks. Furthermore, our method establishes a new practical baseline for efficient reasoning in SLMs. Source code can be found at https://github.com/scai-research/ccqa_official.
中文: 提出的CCQA方法通过利用循环一致性从推理路径生成并评估问题,有效提升了小型语言模型的推理能力,在多个基准测试中持续优于现有方法,并为高效推理设定了新的实用基准。
English: The proposed CCQA method enhances reasoning in smaller language models by generating and evaluating questions from reasoning paths using cycle consistency, consistently outperforming existing methods across benchmarks and establishing a new baseline for efficient reasoning.

Authors:Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, Kangrui Du, Jialian Wu, Ximeng Sun, Jiang Liu, Qiaolin Yu, Hao Chen, Zicheng Liu, Emad Barsoum
Title: APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation
Abstract:
Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community's growing RL needs, numerous RL frameworks have been proposed. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by at most 44% across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves at most 8% higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems. Our codebase is available at https://github.com/RLsys-Foundation/APRIL
中文摘要:APRIL方法通过动态管理强化学习中的rollout生成过程,有效缓解长尾响应分布导致的GPU闲置问题,在多种任务和框架中显著提升了训练效率和最终精度。
English Summary: The proposed APRIL method enhances reinforcement learning efficiency by dynamically managing rollout generation to reduce GPU idle time caused by long-tail response distributions, achieving significant improvements in throughput and accuracy across various tasks and frameworks.

Authors:Yuzhen Zhou, Jiajun Li, Yusheng Su, Gowtham Ramesh, Zilin Zhu, Xiang Long, Chenyang Zhao, Jin Pan, Xiaodong Yu, Ze Wang, Kangrui Du, Jialian Wu, Ximeng Sun, Jiang Liu, Qiaolin Yu, Hao Chen, Zicheng Liu, Emad Barsoum
Title: APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation
Abstract:
Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community's growing RL needs, numerous RL frameworks have been proposed. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by 22.5% on average (at most 44%) across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves 2.1% on average(at most 8%) higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems. Our codebase is available at https://github.com/RLsys-Foundation/APRIL
中文摘要:APRIL方法通过动态管理强化学习中的rollout生成过程,有效缓解长尾响应分布导致的GPU闲置问题,在多种任务和框架中显著提升了训练效率和最终精度。
English Summary: The proposed APRIL method enhances reinforcement learning efficiency by dynamically managing rollout generation to reduce GPU idle time caused by long-tail response distributions, achieving significant improvements in throughput and accuracy across various tasks and frameworks.

Authors:Mohammad Hosseini, Maryam M. Shanechi
Title: Dynamical Modeling of Behaviorally Relevant Spatiotemporal Patterns in Neural Imaging Data
Abstract:
High-dimensional imaging of neural activity, such as widefield calcium and functional ultrasound imaging, provide a rich source of information for understanding the relationship between brain activity and behavior. Accurately modeling neural dynamics in these modalities is crucial for understanding this relationship but is hindered by the high-dimensionality, complex spatiotemporal dependencies, and prevalent behaviorally irrelevant dynamics in these modalities. Existing dynamical models often employ preprocessing steps to obtain low-dimensional representations from neural image modalities. However, this process can discard behaviorally relevant information and miss spatiotemporal structure. We propose SBIND, a novel data-driven deep learning framework to model spatiotemporal dependencies in neural images and disentangle their behaviorally relevant dynamics from other neural dynamics. We validate SBIND on widefield imaging datasets, and show its extension to functional ultrasound imaging, a recent modality whose dynamical modeling has largely remained unexplored. We find that our model effectively identifies both local and long-range spatial dependencies across the brain while also dissociating behaviorally relevant neural dynamics. Doing so, SBIND outperforms existing models in neural-behavioral prediction. Overall, SBIND provides a versatile tool for investigating the neural mechanisms underlying behavior using imaging modalities.
中文: SBIND是一种新型深度学习框架,能有效建模神经影像数据的时空依赖性并分离行为相关动态,在宽场和功能超声成像等模态的神经行为预测中优于现有模型。
English: SBIND is a novel deep learning framework that effectively models spatiotemporal dependencies in neural imaging data to disentangle behaviorally relevant dynamics, outperforming existing models in neural-behavioral prediction across modalities like widefield and functional ultrasound imaging.

Authors:Md Mostafijur Rahman, Radu Marculescu
Title: MK-UNet: Multi-kernel Lightweight CNN for Medical Image Segmentation
Abstract:
In this paper, we introduce MK-UNet, a paradigm shift towards ultra-lightweight, multi-kernel U-shaped CNNs tailored for medical image segmentation. Central to MK-UNet is the multi-kernel depth-wise convolution block (MKDC) we design to adeptly process images through multiple kernels, while capturing complex multi-resolution spatial relationships. MK-UNet also emphasizes the images salient features through sophisticated attention mechanisms, including channel, spatial, and grouped gated attention. Our MK-UNet network, with a modest computational footprint of only 0.316M parameters and 0.314G FLOPs, represents not only a remarkably lightweight, but also significantly improved segmentation solution that provides higher accuracy over state-of-the-art (SOTA) methods across six binary medical imaging benchmarks. Specifically, MK-UNet outperforms TransUNet in DICE score with nearly 333$\times$ and 123$\times$ fewer parameters and FLOPs, respectively. Similarly, when compared against UNeXt, MK-UNet exhibits superior segmentation performance, improving the DICE score up to 6.7% margins while operating with 4.7$\times$ fewer #Params. Our MK-UNet also outperforms other recent lightweight networks, such as MedT, CMUNeXt, EGE-UNet, and Rolling-UNet, with much lower computational resources. This leap in performance, coupled with drastic computational gains, positions MK-UNet as an unparalleled solution for real-time, high-fidelity medical diagnostics in resource-limited settings, such as point-of-care devices. Our implementation is available at https://github.com/SLDGroup/MK-UNet.
中文: MK-UNet提出了一种超轻量级多核U形CNN用于医学图像分割,在参数和计算量大幅减少的情况下,相比现有最优方法实现了更高的精度,非常适合资源受限的实时诊断场景。
English: MK-UNet introduces an ultra-lightweight multi-kernel U-shaped CNN for medical image segmentation, achieving higher accuracy with significantly fewer parameters and computational costs than state-of-the-art methods, making it ideal for resource-limited real-time diagnostics.

Authors:Binhua Huang, Wendong Yao, Shaowu Chen, Guoxin Wang, Qingyuan Wang, Soumyabrata Dev
Title: MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition
Abstract:
We introduce MoCrop, a motion-aware adaptive cropping module for efficient video action recognition in the compressed domain. MoCrop uses motion vectors that are available in H.264 video to locate motion-dense regions and produces a single clip-level crop that is applied to all I-frames at inference. The module is training free, adds no parameters, and can be plugged into diverse backbones. A lightweight pipeline that includes denoising & merge (DM), Monte Carlo sampling (MCS), and adaptive cropping (AC) via a motion-density submatrix search yields robust crops with negligible overhead. On UCF101, MoCrop improves accuracy or reduces compute. With ResNet-50, it delivers +3.5% Top-1 accuracy at equal FLOPs (attention setting), or +2.4% Top-1 accuracy with 26.5% fewer FLOPs (efficiency setting). Applied to CoViAR, it reaches 89.2% Top-1 accuracy at the original cost and 88.5% Top-1 accuracy while reducing compute from 11.6 to 8.5 GFLOPs. Consistent gains on MobileNet-V3, EfficientNet-B1, and Swin-B indicate strong generality and make MoCrop practical for real-time deployment in the compressed domain. Our code and models are available at https://github.com/microa/MoCrop.
MoCrop是一种无需训练、基于运动感知的自适应裁剪模块,通过利用压缩视频中的运动矢量聚焦关键区域,提升视频动作识别的效率,在不同模型上均能提高精度并降低计算开销。
MoCrop is a training-free, motion-aware adaptive cropping module that enhances video action recognition efficiency by using motion vectors from compressed videos to focus on key regions, improving accuracy and reducing computational costs across various models.

Authors:Han-Lin Hsieh, Maryam M. Shanechi
Title: Probabilistic Geometric Principal Component Analysis with application to neural data
Abstract:
Dimensionality reduction is critical across various domains of science including neuroscience. Probabilistic Principal Component Analysis (PPCA) is a prominent dimensionality reduction method that provides a probabilistic approach unlike the deterministic approach of PCA and serves as a connection between PCA and Factor Analysis (FA). Despite their power, PPCA and its extensions are mainly based on linear models and can only describe the data in a Euclidean coordinate system. However, in many neuroscience applications, data may be distributed around a nonlinear geometry (i.e., manifold) rather than lying in the Euclidean space. We develop Probabilistic Geometric Principal Component Analysis (PGPCA) for such datasets as a new dimensionality reduction algorithm that can explicitly incorporate knowledge about a given nonlinear manifold that is first fitted from these data. Further, we show how in addition to the Euclidean coordinate system, a geometric coordinate system can be derived for the manifold to capture the deviations of data from the manifold and noise. We also derive a data-driven EM algorithm for learning the PGPCA model parameters. As such, PGPCA generalizes PPCA to better describe data distributions by incorporating a nonlinear manifold geometry. In simulations and brain data analyses, we show that PGPCA can effectively model the data distribution around various given manifolds and outperforms PPCA for such data. Moreover, PGPCA provides the capability to test whether the new geometric coordinate system better describes the data than the Euclidean one. Finally, PGPCA can perform dimensionality reduction and learn the data distribution both around and on the manifold. These capabilities make PGPCA valuable for enhancing the efficacy of dimensionality reduction for analysis of high-dimensional data that exhibit noise and are distributed around a nonlinear manifold.
Chinese: 我们提出了概率几何主成分分析(PGPCA),这是一种新的降维方法,通过引入非线性流形几何扩展了概率主成分分析,能够更有效地建模分布在弯曲曲面周围的数据,在仿真和脑数据分析中均优于传统方法。
English: We introduce Probabilistic Geometric Principal Component Analysis (PGPCA), a novel dimensionality reduction method that extends Probabilistic PCA by incorporating nonlinear manifold geometry, enabling more effective modeling of data distributed around curved surfaces and outperforming traditional approaches in both simulations and brain data analyses.

Authors:Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud
Title: CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density
Abstract:
Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($ρ$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.
中文摘要:CogniLoad是基于认知负荷理论的新型基准测试,通过独立调控内在难度、干扰信息和任务长度三个核心维度,系统评估了22个先进大语言模型的推理能力,揭示了它们在任务长度敏感性、复杂度容忍度和干扰响应方面的差异化表现。
English Summary: CogniLoad is a synthetic benchmark based on Cognitive Load Theory that enables precise evaluation of LLM reasoning by independently controlling intrinsic difficulty, distractor interference, and task length, revealing distinct performance patterns across 22 state-of-the-art models.

Authors:Nikolai Skripko
Title: Instruction-Following Evaluation in Function Calling for Large Language Models
Abstract:
Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench (arXiv:2506.07982), and ACEBench (arXiv:2501.12851) evaluate argument correctness but do not test adherence to format instructions embedded in parameter descriptions, such as enclosing values in double quotes or using ISO date formats. We introduce IFEval-FC, a benchmark inspired by IFEval (arXiv:2311.07911) that assesses precise instruction following in function calling. IFEval-FC encodes verifiable formats directly within JSON schema descriptions, for example specifying that a value must not contain punctuation. It includes 750 test cases, each consisting of a function with an embedded format for one of its input parameters and a corresponding user query. Evaluation is fully algorithmic, ensuring objectivity, reproducibility, and scalability. Our results show that even state-of-the-art proprietary models, including GPT-5 and Claude 4.1 Opus, frequently fail to follow basic formatting rules, highlighting a practical limitation for real-world agent systems. The complete codebase and data are publicly available at https://github.com/Skripkon/IFEval-FC.
中文摘要:作者提出了IFEval-FC基准测试,专门评估大语言模型在函数调用中遵循精确格式指令的能力,结果表明即使GPT-5等先进模型也经常无法遵守基本格式规则,这暴露了现有基准仅测试参数正确性的不足。
English Summary: The authors introduce IFEval-FC, a benchmark that evaluates large language models' ability to follow precise formatting instructions in function calling, revealing that even advanced models like GPT-5 struggle with basic format rules despite existing benchmarks focusing only on argument correctness.

Authors:Binhua Huang, Ni Wang, Wendong Yao, Soumyabrata Dev
Title: MVP: Motion Vector Propagation for Zero-Shot Video Object Detection
Abstract:
Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at https://github.com/microa/MVP.
中文: 本文提出MVP方法,通过仅在关键帧使用开放词汇检测器并利用压缩域运动矢量传播检测结果,无需训练即可降低计算成本,在保持零样本视频覆盖能力的同时获得接近逐帧检测的精度。
English: This paper introduces MVP, a training-free method that reduces computational costs by applying an open-vocabulary detector only to keyframes and propagating detections via compressed-domain motion vectors, achieving competitive accuracy while maintaining zero-shot video coverage.

Authors:Mehrdad Moradi, Shengzhe Chen, Hao Yan, Kamran Paynabar
Title: A Single Image Is All You Need: Zero-Shot Anomaly Localization Without Training Data
Abstract:
Anomaly detection in images is typically addressed by learning from collections of training data or relying on reference samples. In many real-world scenarios, however, such training data may be unavailable, and only the test image itself is provided. We address this zero-shot setting by proposing a single-image anomaly localization method that leverages the inductive bias of convolutional neural networks, inspired by Deep Image Prior (DIP). Our method is named Single Shot Decomposition Network (SSDnet). Our key assumption is that natural images often exhibit unified textures and patterns, and that anomalies manifest as localized deviations from these repetitive or stochastic patterns. To learn the deep image prior, we design a patch-based training framework where the input image is fed directly into the network for self-reconstruction, rather than mapping random noise to the image as done in DIP. To avoid the model simply learning an identity mapping, we apply masking, patch shuffling, and small Gaussian noise. In addition, we use a perceptual loss based on inner-product similarity to capture structure beyond pixel fidelity. Our approach needs no external training data, labels, or references, and remains robust in the presence of noise or missing pixels. SSDnet achieves 0.99 AUROC and 0.60 AUPRC on MVTec-AD and 0.98 AUROC and 0.67 AUPRC on the fabric dataset, outperforming state-of-the-art methods. The implementation code will be released at https://github.com/mehrdadmoradi124/SSDnet
中文: SSDnet是一种无需训练数据的零样本异常定位方法,通过基于图像块的自重构网络结合掩码和感知损失来检测异常,在多个基准数据集上取得了领先的性能。
English: SSDnet is a zero-shot anomaly localization method that uses a patch-based self-reconstruction network with masking and perceptual loss to detect anomalies without any training data, achieving state-of-the-art performance on benchmark datasets.

Authors:Yixin Zhang, Ryan Chamberlain, Lawrence Ngo, Kevin Kramer, Maciej A. Mazurowski
Title: Rethinking Pulmonary Embolism Segmentation: A Study of Current Approaches and Challenges with an Open Weight Model
Abstract:
In this study, we curated a densely annotated in-house dataset comprising 490 CTPA scans. Using this dataset, we systematically evaluated nine widely used segmentation architectures from both the CNN and Vision Transformer (ViT) families, initialized with either pretrained or random weights, under a unified testing framework as a performance audit. Our study leads to several important observations: (1) 3D U-Net with a ResNet encoder remains a highly effective architecture for PE segmentation; (2) 3D models are particularly well-suited to this task given the morphological characteristics of emboli; (3) CNN-based models generally yield superior performance compared to their ViT-based counterparts in PE segmentation; (4) classification-based pretraining, even on large PE datasets, can adversely impact segmentation performance compared to training from scratch, suggesting that PE classification and segmentation may rely on different sets of discriminative features; (5) different model architectures show a highly consistent pattern of segmentation performance when trained on the same data; and (6) while central and large emboli can be segmented with satisfactory accuracy, distal emboli remain challenging due to both task complexity and the scarcity of high-quality datasets. Besides these findings, our best-performing model achieves a mean Dice score of 0.7131 for segmentation. It detects 181 emboli with 49 false positives and 28 false negatives from 60 in-house testing scans. Its generalizability is further validated on public datasets.
中文: 本研究利用490例密集标注的CTPA扫描数据集评估了九种分割架构,发现带ResNet编码器的3D U-Net在肺栓塞分割中表现最佳,基于CNN的模型普遍优于视觉Transformer,且预训练可能损害分割性能。
English: This study evaluated nine segmentation architectures using a densely annotated dataset of 490 CTPA scans, finding that 3D U-Net with a ResNet encoder performs best for pulmonary embolism segmentation, with CNN-based models generally outperforming Vision Transformers and pretraining potentially harming performance.

Authors:Yi Gu, Kuniaki Saito, Jiaxin Ma
Title: Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction
Abstract:
As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available at https://github.com/omron-sinicx/medical-modality-dropout.
中文摘要:本研究提出一种多模态学习框架,通过改进的模态丢弃和对比学习技术增强对缺失数据和模态不平衡的鲁棒性,在临床应用中实现了卓越性能。
English Summary: This study introduces a multimodal learning framework that enhances robustness to missing data and modality imbalance through improved dropout techniques and contrastive learning, achieving superior performance in clinical applications.

Authors:Jialong Mai, Jinxin Ji, Xiaofen Xing, Chen Yang, Weidong Chen, Jingyuan Xing, Xiangmin Xu
Title: MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech
Abstract:
Mainstream Automatic Speech Recognition (ASR) systems excel at transcribing lexical content, but largely fail to recognize nonverbal vocalizations (NVs) embedded in speech, such as sighs, laughs, and coughs. This capability is important for a comprehensive understanding of human communication, as NVs convey crucial emotional and intentional cues. Progress in NV-aware ASR has been hindered by the lack of high-quality, well-annotated datasets. To address this gap, we introduce MNV-17, a 7.55-hour performative Mandarin speech dataset. Unlike most existing corpora that rely on model-based detection, MNV-17's performative nature ensures high-fidelity, clearly articulated NV instances. To the best of our knowledge, MNV-17 provides the most extensive set of nonverbal vocalization categories, comprising 17 distinct and well-balanced classes of common NVs. We benchmarked MNV-17 on four mainstream ASR architectures, evaluating their joint performance on semantic transcription and NV classification. The dataset and the pretrained model checkpoints will be made publicly available to facilitate future research in expressive ASR.
中文: 主流语音识别系统难以识别叹息、笑声等非语言声音,为此我们推出了MNV-17数据集,该高质量标注的普通话语音库包含17类非语言声音,将促进情感语音识别研究的发展。
English: Mainstream ASR systems struggle to recognize nonverbal vocalizations like sighs and laughs, so the MNV-17 dataset is introduced to address this gap by providing high-quality, annotated Mandarin speech with 17 distinct NV categories for improved expressive ASR research.

Authors:Felix Petre, Lasse Bienzeisler, Bernhard Friedrich
Title: Introducing a novel Location-Assignment Algorithm for Activity-Based Transport Models: CARLA
Abstract:
This paper introduces CARLA (spatially Constrained Anchor-based Recursive Location Assignment), a recursive algorithm for assigning secondary or any activity locations in activity-based travel models. CARLA minimizes distance deviations while integrating location potentials, ensuring more realistic activity distributions. The algorithm decomposes trip chains into smaller subsegments, using geometric constraints and configurable heuristics to efficiently search the solution space. Compared to a state-of-the-art relaxation-discretization approach, CARLA achieves significantly lower mean deviations, even under limited runtimes. It is robust to real-world data inconsistencies, such as infeasible distances, and can flexibly adapt to various priorities, such as emphasizing location attractiveness or distance accuracy. CARLA's versatility and efficiency make it a valuable tool for improving the spatial accuracy of activity-based travel models and agent-based transport simulations. Our implementation is available at https://github.com/tnoud/carla.
Chinese: CARLA是一种递归算法,通过最小化距离偏差并灵活处理现实数据不一致性,有效提升基于活动的出行模型中活动地点分配的准确性和实用性。
English: CARLA is a recursive algorithm that enhances activity-based travel models by efficiently assigning activity locations with minimal distance deviations and robust handling of real-world data inconsistencies.

Authors:Ling Yue, Nithin Somasekharan, Tingwen Zhang, Yadi Cao, Shaowu Pan
Title: Foam-Agent: An End-to-End Composable Multi-Agent Framework for Automating CFD Simulation in OpenFOAM
Abstract:
Computational Fluid Dynamics (CFD) is an essential simulation tool in engineering, yet its steep learning curve and complex manual setup create significant barriers. To address these challenges, we introduce Foam-Agent, a multi-agent framework that automates the entire end-to-end OpenFOAM workflow from a single natural language prompt. Our key innovations address critical gaps in existing systems: 1. An Comprehensive End-to-End Simulation Automation: Foam-Agent is the first system to manage the full simulation pipeline, including advanced pre-processing with a versatile Meshing Agent capable of handling external mesh files and generating new geometries via Gmsh, automatic generation of HPC submission scripts, and post-simulation visualization via ParaView. 2. Composable Service Architecture: Going beyond a monolithic agent, the framework uses Model Context Protocol (MCP) to expose its core functions as discrete, callable tools. This allows for flexible integration and use by other agentic systems, such as Claude-code, for more exploratory workflows. 3. High-Fidelity Configuration Generation: We achieve superior accuracy through a Hierarchical Multi-Index RAG for precise context retrieval and a dependency-aware generation process that ensures configuration consistency. Evaluated on a benchmark of 110 simulation tasks, Foam-Agent achieves an 88.2% success rate with Claude 3.5 Sonnet, significantly outperforming existing frameworks (55.5% for MetaOpenFOAM). Foam-Agent dramatically lowers the expertise barrier for CFD, demonstrating how specialized multi-agent systems can democratize complex scientific computing. The code is public at https://github.com/csml-rpi/Foam-Agent.
中文: Foam-Agent是一个多智能体框架,通过单一自然语言提示即可自动化整个OpenFOAM工作流程,在基准测试中达到88.2%的成功率,显著降低了计算流体动力学的专业门槛。
English: Foam-Agent is a multi-agent framework that automates the entire OpenFOAM workflow from a single natural language prompt, achieving an 88.2% success rate on benchmark tests and significantly lowering the expertise barrier for Computational Fluid Dynamics.

Authors:Ling Yue, Nithin Somasekharan, Tingwen Zhang, Yadi Cao, Shaowu Pan
Title: Foam-Agent 2.0: An End-to-End Composable Multi-Agent Framework for Automating CFD Simulation in OpenFOAM
Abstract:
Computational Fluid Dynamics (CFD) is an essential simulation tool in engineering, yet its steep learning curve and complex manual setup create significant barriers. To address these challenges, we introduce Foam-Agent, a multi-agent framework that automates the entire end-to-end OpenFOAM workflow from a single natural language prompt. Our key innovations address critical gaps in existing systems: 1. An Comprehensive End-to-End Simulation Automation: Foam-Agent is the first system to manage the full simulation pipeline, including advanced pre-processing with a versatile Meshing Agent capable of handling external mesh files and generating new geometries via Gmsh, automatic generation of HPC submission scripts, and post-simulation visualization via ParaView. 2. Composable Service Architecture: Going beyond a monolithic agent, the framework uses Model Context Protocol (MCP) to expose its core functions as discrete, callable tools. This allows for flexible integration and use by other agentic systems, such as Claude-code, for more exploratory workflows. 3. High-Fidelity Configuration Generation: We achieve superior accuracy through a Hierarchical Multi-Index RAG for precise context retrieval and a dependency-aware generation process that ensures configuration consistency. Evaluated on a benchmark of 110 simulation tasks, Foam-Agent achieves an 88.2% success rate with Claude 3.5 Sonnet, significantly outperforming existing frameworks (55.5% for MetaOpenFOAM). Foam-Agent dramatically lowers the expertise barrier for CFD, demonstrating how specialized multi-agent systems can democratize complex scientific computing. The code is public at https://github.com/csml-rpi/Foam-Agent.
中文: Foam-Agent是一个多智能体框架,通过单一自然语言提示即可自动化整个OpenFOAM工作流程,在基准测试中达到88.2%的成功率,显著降低了计算流体动力学的专业门槛。
English: Foam-Agent is a multi-agent framework that automates the entire OpenFOAM workflow from a single natural language prompt, achieving an 88.2% success rate on benchmark tests and significantly lowering the expertise barrier for Computational Fluid Dynamics.

Authors:Hongyi Luo, Qing Cheng, Daniel Matos, Hari Krishna Gadi, Yanfeng Zhang, Lu Liu, Yongliang Wang, Niclas Zeller, Daniel Cremers, Liqiu Meng
Title: TurnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route
Abstract:
Humans can interpret geospatial information through natural language, while the geospatial cognition capabilities of Large Language Models (LLMs) remain underexplored. Prior research in this domain has been constrained by non-quantifiable metrics, limited evaluation datasets and unclear research hierarchies. Therefore, we propose a large-scale benchmark and conduct a comprehensive evaluation of the geospatial route cognition of LLMs. We create a large-scale evaluation dataset comprised of 36000 routes from 12 metropolises worldwide. Then, we introduce PathBuilder, a novel tool for converting natural language instructions into navigation routes, and vice versa, bridging the gap between geospatial information and natural language. Finally, we propose a new evaluation framework and metrics to rigorously assess 11 state-of-the-art (SOTA) LLMs on the task of route reversal. The benchmark reveals that LLMs exhibit limitation to reverse routes: most reverse routes neither return to the starting point nor are similar to the optimal route. Additionally, LLMs face challenges such as low robustness in route generation and high confidence for their incorrect answers. Code\ \&\ Data available here: \href{https://github.com/bghjmn32/EMNLP2025_Turnback}{TurnBack.}
中文摘要:本研究提出了一个大规模基准来评估大语言模型的地理空间路线认知能力,发现其在准确反转路线方面存在局限,并揭示了路线生成鲁棒性低及对错误答案过度自信等问题。
English Summary: This study introduces a large-scale benchmark to evaluate the geospatial route cognition of Large Language Models, revealing their limitations in accurately reversing routes and highlighting issues like low robustness and misplaced confidence in incorrect responses.

Authors:Xiuding Cai, Yaoyao Zhu, Linjie Fu, Dong Miao, Yu Yao
Title: Self Identity Mapping
Abstract:
Regularization is essential in deep learning to enhance generalization and mitigate overfitting. However, conventional techniques often rely on heuristics, making them less reliable or effective across diverse settings. We propose Self Identity Mapping (SIM), a simple yet effective, data-intrinsic regularization framework that leverages an inverse mapping mechanism to enhance representation learning. By reconstructing the input from its transformed output, SIM reduces information loss during forward propagation and facilitates smoother gradient flow. To address computational inefficiencies, We instantiate SIM as $ ρ\text{SIM} $ by incorporating patch-level feature sampling and projection-based method to reconstruct latent features, effectively lowering complexity. As a model-agnostic, task-agnostic regularizer, SIM can be seamlessly integrated as a plug-and-play module, making it applicable to different network architectures and tasks. We extensively evaluate $ρ\text{SIM}$ across three tasks: image classification, few-shot prompt learning, and domain generalization. Experimental results show consistent improvements over baseline methods, highlighting $ρ\text{SIM}$'s ability to enhance representation learning across various tasks. We also demonstrate that $ρ\text{SIM}$ is orthogonal to existing regularization methods, boosting their effectiveness. Moreover, our results confirm that $ρ\text{SIM}$ effectively preserves semantic information and enhances performance in dense-to-dense tasks, such as semantic segmentation and image translation, as well as in non-visual domains including audio classification and time series anomaly detection. The code is publicly available at https://github.com/XiudingCai/SIM-pytorch.
中文: 本文提出自身份映射(SIM)这一数据内在正则化框架,通过逆向映射机制从变换输出重构输入,以增强表征学习并改善梯度流,其计算高效且能作为即插即用模块广泛应用于多种网络架构与任务。
English: The authors propose Self Identity Mapping (SIM), a data-intrinsic regularization framework that uses inverse mapping to reconstruct inputs from transformed outputs, improving representation learning and gradient flow while being computationally efficient and applicable across various tasks and architectures.

Authors:Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, Chi Chen, Guoyang Zeng, Yuxuan Li, Ganqu Cui, Ning Ding, Xu Han, Yuan Yao, Zhiyuan Liu, Maosong Sun
Title: MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
Abstract:
Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7\% GPU memory cost and 8.7\% inference time of Qwen2.5-VL 7B.
Chinese: MiniCPM-V 4.5是一款高效的80亿参数多模态模型,在性能上超越主流专有模型和更大规模开源模型的同时,显著降低了GPU内存消耗和推理时间。
English: MiniCPM-V 4.5 is an efficient 8B parameter multimodal model that surpasses leading proprietary and larger open-source models in performance while significantly reducing GPU memory and inference time.

Authors:Kairong Han, Weidong Huang, Taiyang Zhou, Peng Zhen, Kun Kuang
Title: Augmenting Limited and Biased RCTs through Pseudo-Sample Matching-Based Observational Data Fusion Method
Abstract:
In the online ride-hailing pricing context, companies often conduct randomized controlled trials (RCTs) and utilize uplift models to assess the effect of discounts on customer orders, which substantially influences competitive market outcomes. However, due to the high cost of RCTs, the proportion of trial data relative to observational data is small, which only accounts for 0.65\% of total traffic in our context, resulting in significant bias when generalizing to the broader user base. Additionally, the complexity of industrial processes reduces the quality of RCT data, which is often subject to heterogeneity from potential interference and selection bias, making it difficult to correct. Moreover, existing data fusion methods are challenging to implement effectively in complex industrial settings due to the high dimensionality of features and the strict assumptions that are hard to verify with real-world data. To address these issues, we propose an empirical data fusion method called pseudo-sample matching. By generating pseudo-samples from biased, low-quality RCT data and matching them with the most similar samples from large-scale observational data, the method expands the RCT dataset while mitigating its heterogeneity. We validated the method through simulation experiments, conducted offline and online tests using real-world data. In a week-long online experiment, we achieved a 0.41\% improvement in profit, which is a considerable gain when scaled to industrial scenarios with hundreds of millions in revenue. In addition, we discuss the harm to model training, offline evaluation, and online economic benefits when the RCT data quality is not high, and emphasize the importance of improving RCT data quality in industrial scenarios. Further details of the simulation experiments can be found in the GitHub repository https://github.com/Kairong-Han/Pseudo-Matching.
中文: 本研究提出了一种伪样本匹配方法,通过将有限且有偏的随机对照试验数据与大量观测数据融合,改善了数据质量,在线测试中实现了0.41%的利润提升,有效解决了工业场景中数据融合的难题。
English: This study introduces a pseudo-sample matching method that enhances the quality of limited and biased randomized controlled trial (RCT) data by integrating it with extensive observational data, leading to a 0.41% profit increase in online tests and addressing challenges in industrial data fusion.

Authors:Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, Yuxiao Dong
Title: MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents
Abstract:
Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MOBILERL to enhance GUI agents in mobile environments. Its core component is the Difficulty-Adaptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (75.8%) and AndroidLab (46.8%). The MOBILERL framework is adopted in the AutoGLM products, and also open-sourced at https://github.com/THUDM/MobileRL.
中文摘要:MOBILERL框架通过自适应强化学习策略提升移动GUI代理性能,在Android平台上取得领先成果,并已应用于AutoGLM产品中开源发布。
English Summary: The MOBILERL framework enhances mobile GUI agents through adaptive reinforcement learning strategies, achieving state-of-the-art performance on Android platforms and being implemented in AutoGLM products.

Authors:Julian Kaltheuner, Alexander Oebel, Hannah Droege, Patrick Stotko, Reinhard Klein
Title: Preconditioned Deformation Grids
Abstract:
Dynamic surface reconstruction of objects from point cloud sequences is a challenging field in computer graphics. Existing approaches either require multiple regularization terms or extensive training data which, however, lead to compromises in reconstruction accuracy as well as over-smoothing or poor generalization to unseen objects and motions. To address these lim- itations, we introduce Preconditioned Deformation Grids, a novel technique for estimating coherent deformation fields directly from unstructured point cloud sequences without requiring or forming explicit correspondences. Key to our approach is the use of multi-resolution voxel grids that capture the overall motion at varying spatial scales, enabling a more flexible deformation representation. In conjunction with incorporating grid-based Sobolev preconditioning into gradient-based optimization, we show that applying a Chamfer loss between the input point clouds as well as to an evolving template mesh is sufficient to obtain accurate deformations. To ensure temporal consistency along the object surface, we include a weak isometry loss on mesh edges which complements the main objective without constraining deformation fidelity. Extensive evaluations demonstrate that our method achieves superior results, particularly for long sequences, compared to state-of-the-art techniques.
中文摘要:本文提出预条件变形网格技术,通过多分辨率体素网格捕捉不同空间尺度的整体运动,结合基于网格的Sobolev预条件优化,仅需使用Chamfer损失和弱等距损失即可从无序点云序列中实现高精度的动态表面重建,在长序列处理上优于现有方法。
English Summary: This paper introduces Preconditioned Deformation Grids, a novel technique that reconstructs dynamic surfaces from point cloud sequences using multi-resolution voxel grids and grid-based Sobolev preconditioning, achieving superior accuracy and temporal consistency without requiring explicit correspondences or extensive training data.

Authors:Jiahe Li, Jiawei Zhang, Youmin Zhang, Xiao Bai, Jin Zheng, Xiaohan Yu, Lin Gu
Title: GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction
Abstract:
Reconstructing accurate surfaces with radiance fields has achieved remarkable progress in recent years. However, prevailing approaches, primarily based on Gaussian Splatting, are increasingly constrained by representational bottlenecks. In this paper, we introduce GeoSVR, an explicit voxel-based framework that explores and extends the under-investigated potential of sparse voxels for achieving accurate, detailed, and complete surface reconstruction. As strengths, sparse voxels support preserving the coverage completeness and geometric clarity, while corresponding challenges also arise from absent scene constraints and locality in surface refinement. To ensure correct scene convergence, we first propose a Voxel-Uncertainty Depth Constraint that maximizes the effect of monocular depth cues while presenting a voxel-oriented uncertainty to avoid quality degradation, enabling effective and robust scene constraints yet preserving highly accurate geometries. Subsequently, Sparse Voxel Surface Regularization is designed to enhance geometric consistency for tiny voxels and facilitate the voxel-based formation of sharp and accurate surfaces. Extensive experiments demonstrate our superior performance compared to existing methods across diverse challenging scenarios, excelling in geometric accuracy, detail preservation, and reconstruction completeness while maintaining high efficiency. Code is available at https://github.com/Fictionarry/GeoSVR.
中文摘要:GeoSVR提出了一种新颖的显式体素框架,通过引入体素不确定性深度约束和稀疏体素表面正则化,利用稀疏体素克服辐射场表面重建中的表示瓶颈,在保持高效率的同时实现了卓越的几何精度、细节保留和完整重建效果。
English Summary: GeoSVR introduces a novel explicit voxel-based framework that overcomes representational bottlenecks in radiance field surface reconstruction by leveraging sparse voxels with innovative constraints and regularization, achieving superior geometric accuracy, detail preservation, and completeness while maintaining high efficiency.

Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
Title: TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Abstract:
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1
中文: TempSamp-R1是一种强化微调框架,通过引入离策略监督和非线性软优势计算,提升了多模态大语言模型在视频时序定位任务中的性能,在多个基准数据集上取得了最优效果。
English: TempSamp-R1 is a reinforcement fine-tuning framework that enhances multimodal large language models for video temporal grounding by incorporating off-policy supervision and a non-linear soft advantage method, achieving state-of-the-art results on benchmark datasets.

Authors:Richard Cornelius Suwandi, Feng Yin, Juntao Wang, Renjie Li, Tsung-Hui Chang, Sergios Theodoridis
Title: Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLMs
Abstract:
The efficiency of Bayesian optimization (BO) relies heavily on the choice of the Gaussian process (GP) kernel, which plays a central role in balancing exploration and exploitation under limited evaluation budgets. Traditional BO methods often rely on fixed or heuristic kernel selection strategies, which can result in slow convergence or suboptimal solutions when the chosen kernel is poorly suited to the underlying objective function. To address this limitation, we propose a freshly-baked Context-Aware Kernel Evolution (CAKE) to enhance BO with large language models (LLMs). Concretely, CAKE leverages LLMs as the crossover and mutation operators to adaptively generate and refine GP kernels based on the observed data throughout the optimization process. To maximize the power of CAKE, we further propose BIC-Acquisition Kernel Ranking (BAKER) to select the most effective kernel through balancing the model fit measured by the Bayesian information criterion (BIC) with the expected improvement at each iteration of BO. Extensive experiments demonstrate that our fresh CAKE-based BO method consistently outperforms established baselines across a range of real-world tasks, including hyperparameter optimization, controller tuning, and photonic chip design. Our code is publicly available at https://github.com/richardcsuwandi/cake.
中文摘要:本文提出的情境感知核演化(CAKE)方法通过大语言模型动态生成和优化高斯过程核,显著提升了贝叶斯优化的性能,大量实验证明该方法在多种实际应用中均优于传统基线方法。
English Summary: The proposed Context-Aware Kernel Evolution (CAKE) method enhances Bayesian optimization by using large language models to dynamically generate and refine Gaussian process kernels, with comprehensive experiments showing its consistent superiority over traditional approaches across various applications.

Authors:Kai Li, Xingxing Weng, Yupeng Deng, Yu Meng, Chao Pang, Gui-Song Xia, Xiangyu Zhao
Title: DragOSM: Extract Building Roofs and Footprints from Aerial Images by Aligning Historical Labels
Abstract:
Extracting polygonal roofs and footprints from remote sensing images is critical for large-scale urban analysis. Most existing methods rely on segmentation-based models that assume clear semantic boundaries of roofs, but these approaches struggle in off- nadir images, where the roof and footprint are significantly displaced, and facade pixels are fused with the roof boundary. With the increasing availability of open vector map annotations, e.g., OpenStreetMap, utilizing historical labels for off-nadir image annotation has become viable because remote sensing images are georeferenced once captured. However, these historical labels commonly suffer from significant positional discrepancies with new images and only have one annotation (roof or footprint), which fails to describe the correct structures of a building. To address these discrepancies, we first introduce a concept of an alignment token, which encodes the correction vector to guide the label correction. Based on this concept, we then propose Drag OpenStreetMap Labels (DragOSM), a novel model designed to align dislocated historical labels with roofs and footprints. Specifically, DragOSM formulates the label alignment as an interactive denoising process, modeling the positional discrepancy as a Gaussian distribution. During training, it learns to correct these errors by simulating misalignment with random Gaussian perturbations; during inference, it iteratively refines the positions of input labels. To validate our method, we further present a new dataset, Repairing Buildings in OSM (ReBO), comprising 179,265 buildings with both OpenStreetMap and manually corrected annotations across 5,473 images from 41 cities. Experimental results on ReBO demonstrate the effectiveness of DragOSM. Code, dataset, and trained models are publicly available at https://github.com/likaiucas/DragOSM.git.
中文摘要:DragOSM模型通过将标签对齐构建为交互式去噪过程,有效校正历史地图标注的位置偏差,并在新构建的ReBO数据集上验证了其优越性能。
English Summary: The proposed DragOSM model effectively corrects positional discrepancies in historical map labels by treating alignment as an interactive denoising process, using a new ReBO dataset for validation.

Authors:Romain Thoreau, Jessie Levillain, Dawa Derksen
Title: Can multimodal representation learning by alignment preserve modality-specific information?
Abstract:
Combining multimodal data is a key issue in a wide range of machine learning tasks, including many remote sensing problems. In Earth observation, early multimodal data fusion methods were based on specific neural network architectures and supervised learning. Ever since, the scarcity of labeled data has motivated self-supervised learning techniques. State-of-the-art multimodal representation learning techniques leverage the spatial alignment between satellite data from different modalities acquired over the same geographic area in order to foster a semantic alignment in the latent space. In this paper, we investigate how this methods can preserve task-relevant information that is not shared across modalities. First, we show, under simplifying assumptions, when alignment strategies fundamentally lead to an information loss. Then, we support our theoretical insight through numerical experiments in more realistic settings. With those theoretical and empirical evidences, we hope to support new developments in contrastive learning for the combination of multimodal satellite data. Our code and data is publicly available at https://github.com/Romain3Ch216/alg_maclean_25.
中文摘要:本文探讨了地球观测中多模态对比学习如何保留跨模态未共享的任务相关信息,从理论和实验两方面揭示了对齐策略导致信息损失的问题。
English Summary: This paper examines how multimodal contrastive learning in Earth observation can preserve task-specific information not shared across modalities, revealing both theoretical and empirical evidence of information loss from alignment strategies.

Authors:Yuanhan Wang, Yifei Chen, Shuo Jiang, Wenjing Yu, Mingxuan Liu, Beining Wu, Jinying Zong, Feiwei Qin, Changmiao Wang, Qiyuan Tian
Title: SmaRT: Style-Modulated Robust Test-Time Adaptation for Cross-Domain Brain Tumor Segmentation in MRI
Abstract:
Reliable brain tumor segmentation in MRI is indispensable for treatment planning and outcome monitoring, yet models trained on curated benchmarks often fail under domain shifts arising from scanner and protocol variability as well as population heterogeneity. Such gaps are especially severe in low-resource and pediatric cohorts, where conventional test-time or source-free adaptation strategies often suffer from instability and structural inconsistency. We propose SmaRT, a style-modulated robust test-time adaptation framework that enables source-free cross-domain generalization. SmaRT integrates style-aware augmentation to mitigate appearance discrepancies, a dual-branch momentum strategy for stable pseudo-label refinement, and structural priors enforcing consistency, integrity, and connectivity. This synergy ensures both adaptation stability and anatomical fidelity under extreme domain shifts. Extensive evaluations on sub-Saharan Africa and pediatric glioma datasets show that SmaRT consistently outperforms state-of-the-art methods, with notable gains in Dice accuracy and boundary precision. Overall, SmaRT bridges the gap between algorithmic advances and equitable clinical applicability, supporting robust deployment of MRI-based neuro-oncology tools in diverse clinical environments. Our source code is available at https://github.com/baiyou1234/SmaRT.
中文摘要:SmaRT是一种风格调制测试时自适应框架,通过整合风格感知增强、动量伪标签优化和结构一致性约束,有效提升脑肿瘤分割在跨域场景下的准确性和稳定性,在资源匮乏及儿科数据集中表现优异。
English Summary: SmaRT is a style-modulated test-time adaptation framework that enhances brain tumor segmentation accuracy under domain shifts by integrating style-aware augmentation, momentum-based pseudo-label refinement, and structural consistency priors, demonstrating superior performance in challenging clinical datasets.

Authors:Jamiyan Sukhbaatar, Satoshi Imamura, Ibuki Inoue, Shoya Murakami, Kazi Mahmudul Hassan, Seungwoo Han, Ingon Chanpornpakdi, Toshihisa Tanaka
Title: SingLEM: Single-Channel Large EEG Model
Abstract:
Current deep learning models for electroencephalography (EEG) are often task-specific and depend on large labeled datasets, limiting their adaptability. Although emerging foundation models aim for broader applicability, their rigid dependence on fixed, high-density multi-channel montages restricts their use across heterogeneous datasets and in missing-channel or practical low-channel settings. To address these limitations, we introduce SingLEM, a self-supervised foundation model that learns robust, general-purpose representations from single-channel EEG, making it inherently hardware agnostic. The model employs a hybrid encoder architecture that combines convolutional layers to extract local features with a hierarchical transformer to model both short- and long-range temporal dependencies. SingLEM is pretrained on 71 public datasets comprising over 9,200 subjects and 357,000 single-channel hours of EEG. When evaluated as a fixed feature extractor across six motor imagery and cognitive tasks, aggregated single-channel representations consistently outperformed leading multi-channel foundation models and handcrafted baselines. These results demonstrate that a single-channel approach can achieve state-of-the-art generalization while enabling fine-grained neurophysiological analysis and enhancing interpretability. The source code and pretrained models are available at https://github.com/ttlabtuat/SingLEM.
中文: SingLEM是一种自监督基础模型,通过单通道脑电图学习鲁棒的通用表征,在多种任务中实现最优性能,同时具备硬件无关性和更强的可解释性。
English: SingLEM is a self-supervised foundation model that learns robust, general-purpose representations from single-channel EEG, enabling state-of-the-art performance across diverse tasks while being hardware agnostic and enhancing interpretability.

Authors:Geewook Kim, Minjoon Seo
Title: Does Audio Matter for Modern Video-LLMs and Their Benchmarks?
Abstract:
Modern multimodal large language models often claim "video understanding," yet most evaluations use muted videos or simply discard audio. We ask a direct question: how much does audio actually matter for contemporary Video-LLMs and the benchmarks that certify them? We audit widely used suites and observe that many items are even solvable from a single frame, rendering audio largely redundant. Building on LLaVA-OneVision architecture, we attach a speech/audio encoder (e.g., Whisper) and analyze when audio helps, while addressing audio token explosion with a lightweight Mamba-based state-space token compressor. We find that audio yields minimal gains on recent video benchmarks but is decisive on curated, audio-sensitive subsets. To enable faithful evaluation, we release AVQA-Hard and Music-AVQA-Hard, our model, and code. Our findings surface a growing gap between current academic practice and real-world expectations, and provide practical tools for scalable audio-visual Video-LLMs. We will fully open-source our work at https://github.com/naver-ai/LLaVA-AV-SSM.
中文摘要:当前视频大语言模型在标准测试中音频作用甚微,但在音频敏感任务中至关重要,为此我们推出了新的评估工具和可扩展模型,以弥合学术实践与现实需求之间的差距。
English Summary: Current Video-LLMs show minimal audio benefit on standard benchmarks but prove crucial for audio-sensitive tasks, prompting new evaluation tools and a scalable model to bridge the gap between academic practice and real-world needs.

Authors:Qiushi Han, Yuan Liao, Youhao Si, Liya Huang
Title: Brainprint-Modulated Target Speaker Extraction
Abstract:
Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-subject variability that limits the efficacy of generalized models. To address these issues, we propose Brainprint-Modulated Target Speaker Extraction (BM-TSE), a novel framework for personalized and high-fidelity extraction. BM-TSE first employs a spatio-temporal EEG encoder with an Adaptive Spectral Gain (ASG) module to extract stable features resilient to non-stationarity. The core of our framework is a personalized modulation mechanism, where a unified brainmap embedding is learned under the joint supervision of subject identification (SID) and auditory attention decoding (AAD) tasks. This learned brainmap, encoding both static user traits and dynamic attentional states, actively refines the audio separation process, dynamically tailoring the output to each user. Evaluations on the public KUL and Cocktail Party datasets demonstrate that BM-TSE achieves state-of-the-art performance, significantly outperforming existing methods. Our code is publicly accessible at: https://github.com/rosshan-orz/BM-TSE.
中文:BM-TSE框架通过利用稳定的脑电特征和统一的脑图嵌入,实现了针对目标说话者的个性化提取方法,在公开数据集上取得了领先的性能。
English: The BM-TSE framework introduces a personalized approach to target speaker extraction by leveraging stable EEG features and a unified brainmap embedding, achieving state-of-the-art results on public datasets.

Authors:Milan Straka
Title: CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution
Abstract:
We present CorPipe 25, the winning entry to the CRAC 2025 Shared Task on Multilingual Coreference Resolution. This fourth iteration of the shared task introduces a new LLM track alongside the original unconstrained track, features reduced development and test sets to lower computational requirements, and includes additional datasets. CorPipe 25 represents a complete reimplementation of our previous systems, migrating from TensorFlow to PyTorch. Our system significantly outperforms all other submissions in both the LLM and unconstrained tracks by a substantial margin of 8 percentage points. The source code and trained models are publicly available at https://github.com/ufal/crac2025-corpipe.
中文: CorPipe 25是CRAC 2025共享任务的获胜系统,通过完全基于PyTorch的重构实现,在LLM和无约束双赛道中以8个百分点的显著优势超越所有其他参赛系统。
English: CorPipe 25 is the winning system in the CRAC 2025 Shared Task, featuring a complete PyTorch reimplementation that outperforms all other submissions by 8 percentage points across both LLM and unconstrained tracks.

Authors:Xiangmin Shen, Wenyuan Cheng, Yan Chen, Zhenyuan Li, Yuqiao Gu, Lingzhi Wang, Wencheng Zhao, Dawei Sun, Jiashui Wang
Title: AEAS: Actionable Exploit Assessment System
Abstract:
Security practitioners face growing challenges in exploit assessment, as public vulnerability repositories are increasingly populated with inconsistent and low-quality exploit artifacts. Existing scoring systems, such as CVSS and EPSS, offer limited support for this task. They either rely on theoretical metrics or produce opaque probability estimates without assessing whether usable exploit code exists. In practice, security teams often resort to manual triage of exploit repositories, which is time-consuming, error-prone, and difficult to scale. We present AEAS, an automated system designed to assess and prioritize actionable exploits through static analysis. AEAS analyzes both exploit code and associated documentation to extract a structured set of features reflecting exploit availability, functionality, and setup complexity. It then computes an actionability score for each exploit and produces ranked exploit recommendations. We evaluate AEAS on a dataset of over 5,000 vulnerabilities derived from 600+ real-world applications frequently encountered by red teams. Manual validation and expert review on representative subsets show that AEAS achieves a 100% top-3 success rate in recommending functional exploits and shows strong alignment with expert-validated rankings. These results demonstrate the effectiveness of AEAS in supporting exploit-driven vulnerability prioritization.
中文: 安全从业者面临漏洞利用评估的挑战,现有评分系统支持有限,而AEAS通过静态分析自动评估和优先排序可利用漏洞,在推荐功能性漏洞方面取得了高成功率。
English: Security practitioners struggle with inconsistent exploit artifacts and limited support from existing scoring systems, prompting the development of AEAS, an automated tool that assesses and prioritizes actionable exploits through static analysis, achieving high success rates in recommending functional exploits.

Authors:Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bartłomiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer
Title: Accurate and Efficient Low-Rank Model Merging in Core Space
Abstract:
In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.
中文: Core Space框架能够在共享对齐基中高效合并LoRA适配的神经网络,在保持低秩效率的同时显著提升视觉和语言任务的准确性,且仅需少量计算资源。
English: The Core Space framework enables efficient merging of LoRA-adapted neural networks within a shared alignment basis, preserving low-rank efficiency while significantly boosting accuracy across vision and language tasks with minimal computational resources.

Authors:Aniello Panariello, Daniel Marczak, Simone Magistri, Angelo Porrello, Bartłomiej Twardowski, Andrew D. Bagdanov, Simone Calderara, Joost van de Weijer
Title: Accurate and Efficient Low-Rank Model Merging in Core Space
Abstract:
In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at https://github.com/apanariello4/core-space-merging.
中文: Core Space框架能够在共享对齐基中高效合并LoRA适配的神经网络,在保持低秩效率的同时显著提升视觉和语言任务的准确性,且仅需少量计算资源。
English: The Core Space framework enables efficient merging of LoRA-adapted neural networks within a shared alignment basis, preserving low-rank efficiency while significantly boosting accuracy across vision and language tasks with minimal computational resources.

Authors:Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang
Title: I2VWM: Robust Watermarking for Image to Video Generation
Abstract:
The rapid progress of image-guided video generation (I2V) has raised concerns about its potential misuse in misinformation and fraud, underscoring the urgent need for effective digital watermarking. While existing watermarking methods demonstrate robustness within a single modality, they fail to trace source images in I2V settings. To address this gap, we introduce the concept of Robust Diffusion Distance, which measures the temporal persistence of watermark signals in generated videos. Building on this, we propose I2VWM, a cross-modal watermarking framework designed to enhance watermark robustness across time. I2VWM leverages a video-simulation noise layer during training and employs an optical-flow-based alignment module during inference. Experiments on both open-source and commercial I2V models demonstrate that I2VWM significantly improves robustness while maintaining imperceptibility, establishing a new paradigm for cross-modal watermarking in the era of generative video. \href{https://github.com/MrCrims/I2VWM-Robust-Watermarking-for-Image-to-Video-Generation}{Code Released.}
中文:本文提出I2VWM跨模态水印框架,通过鲁棒扩散距离测量时间持续性,并采用视频模拟噪声和光流对齐技术,显著提升了图像到视频生成中水印的鲁棒性。
English: This paper introduces I2VWM, a cross-modal watermarking framework that enhances robustness in image-to-video generation by measuring temporal persistence through Robust Diffusion Distance and employing video-simulation noise with optical-flow alignment.

Authors:Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo Zheng, Rui Men, Fan Zhou, Bowen Yu, Jianxin Yang, Le Yu, Jingren Zhou, Junyang Lin
Title: Qwen3-Omni Technical Report
Abstract:
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
Chinese: Qwen3-Omni是首个在文本、图像、音频和视频领域均保持顶尖性能的多模态模型,尤其在音频任务上表现卓越,超越了Gemini-2.5-Pro等主流闭源模型。
English: Qwen3-Omni is a groundbreaking multimodal model that maintains top-tier performance across text, image, audio, and video, excelling particularly in audio tasks where it surpasses leading closed-source models.

Authors:Shenwei Kang, Xin Zhang, Wen Liu, Bin Li, Yujie Liu, Bo Gao
Title: DA-Mamba: Dialogue-aware selective state-space model for multimodal engagement estimation
Abstract:
Human engagement estimation in conversational scenarios is essential for applications such as adaptive tutoring, remote healthcare assessment, and socially aware human--computer interaction. Engagement is a dynamic, multimodal signal conveyed by facial expressions, speech, gestures, and behavioral cues over time. In this work we introduce DA-Mamba, a dialogue-aware multimodal architecture that replaces attention-heavy dialogue encoders with Mamba-based selective state-space processing to achieve linear time and memory complexity while retaining expressive cross-modal reasoning. We design a Mamba dialogue-aware selective state-space model composed of three core modules: a Dialogue-Aware Encoder, and two Mamba-based fusion mechanisms: Modality-Group Fusion and Partner-Group Fusion, these modules achieve expressive dialogue understanding. Extensive experiments on three standard benchmarks (NoXi, NoXi-Add, and MPIIGI) show that DA-Mamba surpasses prior state-of-the-art (SOTA) methods in concordance correlation coefficient (CCC), while reducing training time and peak memory; these gains enable processing much longer sequences and facilitate real-time deployment in resource-constrained, multi-party conversational settings. The source code will be available at: https://github.com/kksssssss-ssda/MMEA.
中文摘要:DA-Mamba是一种对话感知的多模态架构,采用基于Mamba的选择性状态空间处理技术,在降低计算资源消耗的同时,实现了对对话场景中人类参与度的高效精准评估。
English Summary: DA-Mamba is a dialogue-aware multimodal architecture that uses Mamba-based selective state-space processing to efficiently estimate human engagement in conversations, achieving superior performance with reduced computational resources.

Authors:Bo Li, Yunkuo Lei, Tingting Bao, Yaxian Wang, Lingling Zhang, Jun Liu
Title: Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion
Abstract:
Multi-focus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box mechanisms are difficult to generate high-quality decision maps. To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are third-generation neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps. Specifically, we first conduct an in-depth analysis of the model's neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF. Based on this analysis, we propose a Neurodynamics-Driven CNP Fusion model (ND-CNPFuse) tailored for the challenging MFIF task. Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF. The code is available at https://github.com/MorvanLi/ND-CNPFuse.
中文: 该研究提出了一种神经动力学驱动的耦合神经P系统,通过分析神经约束并将源图像映射为可解释的脉冲矩阵,为多焦点图像融合生成精确的决策图,无需后处理即达到最先进性能。
English: The study introduces a neurodynamics-driven coupled neural P system that generates precise decision maps for multi-focus image fusion by analyzing neural constraints and mapping images into interpretable spike matrices, achieving state-of-the-art results without post-processing.

Authors:Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, Junhua Zhao
Title: EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
Abstract:
Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench.
中文: EngiBench是一个分层基准,旨在评估大语言模型在工程问题上的表现,涵盖三个难度级别和多个子领域,结果显示当前模型在高级推理和鲁棒性方面仍远不及人类专家。
English: EngiBench is a hierarchical benchmark introduced to evaluate large language models on engineering problems across three difficulty levels and multiple subfields, revealing that current models struggle with high-level reasoning and robustness compared to human experts.

Authors:Julia Matejas, Olaf Żurawski, Nils Strodthoff, Juan Miguel Lopez Alcaraz
Title: Predicting Chest Radiograph Findings from Electrocardiograms Using Interpretable Machine Learning
Abstract:
Purpose: Chest X-rays are essential for diagnosing pulmonary conditions, but limited access in resource-constrained settings can delay timely diagnosis. Electrocardiograms (ECGs), in contrast, are widely available, non-invasive, and often acquired earlier in clinical workflows. This study aims to assess whether ECG features and patient demographics can predict chest radiograph findings using an interpretable machine learning approach. Methods: Using the MIMIC-IV database, Extreme Gradient Boosting (XGBoost) classifiers were trained to predict diverse chest radiograph findings from ECG-derived features and demographic variables. Recursive feature elimination was performed independently for each target to identify the most predictive features. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC) with bootstrapped 95% confidence intervals. Shapley Additive Explanations (SHAP) were applied to interpret feature contributions. Results: Models successfully predicted multiple chest radiograph findings with varying accuracy. Feature selection tailored predictors to each target, and including demographic variables consistently improved performance. SHAP analysis revealed clinically meaningful contributions from ECG features to radiographic predictions. Conclusion: ECG-derived features combined with patient demographics can serve as a proxy for certain chest radiograph findings, enabling early triage or pre-screening in settings where radiographic imaging is limited. Interpretable machine learning demonstrates potential to support radiology workflows and improve patient care.
中文: 本研究通过可解释机器学习证明,心电图特征结合患者人口统计学数据能够预测胸部X光片结果,为影像资源有限的环境提供了一种潜在的早期筛查方案。
English: This study demonstrates that ECG features and patient demographics can predict chest X-ray findings using interpretable machine learning, offering a potential screening solution for resource-limited settings where radiographic access is constrained.

Authors:Mariette Schönfeld, Wannes Meert, Hendrik Blockeel
Title: Tailored Transformation Invariance for Industrial Anomaly Detection
Abstract:
Industrial Anomaly Detection (IAD) is a subproblem within Computer Vision Anomaly Detection that has been receiving increasing amounts of attention due to its applicability to real-life scenarios. Recent research has focused on how to extract the most informative features, contrasting older kNN-based methods that use only pretrained features. These recent methods are much more expensive to train however and could complicate real-life application. Careful study of related work with regards to transformation invariance leads to the idea that popular benchmarks require robustness to only minor translations. With this idea we then formulate LWinNN, a local window based approach that creates a middle ground between kNN based methods that have either complete or no translation invariance. Our experiments demonstrate that this small change increases accuracy considerably, while simultaneously decreasing both train and test time. This teaches us two things: first, the gap between kNN-based approaches and more complex state-of-the-art methodology can still be narrowed by effective usage of the limited data available. Second, our assumption of requiring only limited translation invariance highlights potential areas of interest for future work and the need for more spatially diverse benchmarks, for which our method can hopefully serve as a new baseline. Our code can be found at https://github.com/marietteschonfeld/LWinNN .
中文:提出的LWinNN方法通过引入有限平移不变性,在kNN基础方法和复杂方法之间找到平衡,显著提升检测精度并降低计算成本,同时揭示了当前基准测试需要更大空间多样性的问题。
English: The proposed LWinNN method bridges kNN-based and complex approaches by introducing limited translation invariance, significantly improving accuracy while reducing computational costs, and highlighting the need for more diverse benchmarks.

Authors:Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, Jieping Ye
Title: SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
Abstract:
While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.
中文: 本文提出SD-VLM框架,通过大规模空间测量理解数据集和深度位置编码方法,显著提升了视觉语言模型的三维空间感知能力,在多个空间理解基准测试中表现优异。
English: This paper introduces SD-VLM, a novel framework that enhances vision language models' 3D spatial reasoning through a comprehensive MSMU dataset and depth positional encoding, achieving state-of-the-art performance on spatial benchmarks.

Authors:Sehyun Kim, Hye Jun Lee, Jiwoo Lee, Taemin Lee
Title: Clothing agnostic Pre-inpainting Virtual Try-ON
Abstract:
With the development of deep learning technology, virtual try-on technology has become an important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa has improved the texture distortion problem of diffu-sion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette remain in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing agnostic Pre-inpainting Virtual Try-ON). CaP-VTON has improved the naturalness and consistency of whole-body clothing syn-thesis by integrating multi-category masking based on Dress Code and skin inpainting based on Stable Diffusion. In particular, a generate skin module was introduced to solve the skin restoration problem that occurs when long-sleeved images are converted into short-sleeved or sleeveless ones, and high-quality restoration was implemented consider-ing the human body posture and color. As a result, CaP-VTON recorded 92.5%, which is 15.4% better than Leffa in short-sleeved synthesis accuracy, and showed the performance of consistently reproducing the style and shape of reference clothing in visual evaluation. These structures maintain model-agnostic properties and are applicable to various diffu-sion-based virtual inspection systems, and can contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.
中文: 本研究提出的CaP-VTON虚拟试穿系统通过整合多类别遮罩和皮肤修复技术,显著提升了全身服装合成的自然度与一致性,在精度和视觉还原度上均优于现有方法。
English: This study introduces CaP-VTON, a virtual try-on system that enhances full-body clothing synthesis by integrating multi-category masking and skin inpainting, achieving significant improvements in accuracy and visual consistency over previous methods.

Authors:Yuzhen Lei, Hongbin Xie, Jiaxing Zhao, Shuangxue Liu, Xuan Song
Title: MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents
Abstract:
Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models' abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose \textbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models' robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0-source/MSCoRE.
中文: 大语言模型在单领域问答任务中表现出色,但在复杂多阶段推理和协作能力方面研究不足,为此提出了MSCoRe基准,旨在评估和提升模型在跨领域场景中的多级推理与优化性能。
English: Large Language Models excel in single-domain QA tasks but lack exploration in complex multi-stage reasoning, prompting the creation of the MSCoRe benchmark to evaluate and enhance their collaborative and optimization abilities across diverse sectors.

Authors:Aiming Zhang, Tianyuan Yu, Liang Bai, Jun Tang, Yanming Guo, Yirun Ruan, Yun Zhou, Zhihe Lu
Title: COLA: Context-aware Language-driven Test-time Adaptation
Abstract:
Test-time adaptation (TTA) has gained increasing popularity due to its efficacy in addressing ``distribution shift'' issue while simultaneously protecting data privacy. However, most prior methods assume that a paired source domain model and target domain sharing the same label space coexist, heavily limiting their applicability. In this paper, we investigate a more general source model capable of adaptation to multiple target domains without needing shared labels. This is achieved by using a pre-trained vision-language model (VLM), \egno, CLIP, that can recognize images through matching with class descriptions. While the zero-shot performance of VLMs is impressive, they struggle to effectively capture the distinctive attributes of a target domain. To that end, we propose a novel method -- Context-aware Language-driven TTA (COLA). The proposed method incorporates a lightweight context-aware module that consists of three key components: a task-aware adapter, a context-aware unit, and a residual connection unit for exploring task-specific knowledge, domain-specific knowledge from the VLM and prior knowledge of the VLM, respectively. It is worth noting that the context-aware module can be seamlessly integrated into a frozen VLM, ensuring both minimal effort and parameter efficiency. Additionally, we introduce a Class-Balanced Pseudo-labeling (CBPL) strategy to mitigate the adverse effects caused by class imbalance. We demonstrate the effectiveness of our method not only in TTA scenarios but also in class generalisation tasks. The source code is available at https://github.com/NUDT-Bai-Group/COLA-TTA.
中文:测试时适应(TTA)方法虽能应对分布偏移并保护数据隐私,但通常依赖共享标签空间;本文提出的COLA方法利用视觉语言模型,通过轻量级上下文感知模块和类平衡伪标签策略,无需共享标签即可适应多个目标域,提升了TTA和类泛化任务的性能。
English: Test-time adaptation (TTA) addresses distribution shifts and data privacy but often requires shared label spaces, whereas the proposed COLA method uses a vision-language model with a lightweight context-aware module and class-balanced pseudo-labeling to adapt to multiple target domains without shared labels, enhancing both TTA and class generalization tasks.

Authors:Florinel Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu
Title: PRNU-Bench: A Novel Benchmark and Model for PRNU-Based Camera Identification
Abstract:
We propose a novel benchmark for camera identification via Photo Response Non-Uniformity (PRNU) estimation. The benchmark comprises 13K photos taken with 120+ cameras, where the training and test photos are taken in different scenarios, enabling ``in-the-wild'' evaluation. In addition, we propose a novel PRNU-based camera identification model that employs a hybrid architecture, comprising a denoising autoencoder to estimate the PRNU signal and a convolutional network that can perform 1:N verification of camera devices. Instead of using a conventional approach based on contrastive learning, our method takes the Hadamard product between reference and query PRNU signals as input. This novel design leads to significantly better results compared with state-of-the-art models based on denoising autoencoders and contrastive learning. We release our dataset and code at: https://github.com/CroitoruAlin/PRNU-Bench.
Chinese: 我们提出了一个基于光响应非均匀性估计的相机识别新基准,包含来自120多台相机的1.3万张照片,并采用结合去噪自编码器和卷积网络的混合模型,显著提升了识别性能。
English: We introduce a new benchmark for camera identification using PRNU estimation, featuring 13,000 photos from over 120 cameras and a hybrid model that combines a denoising autoencoder with a convolutional network for improved accuracy.

Authors:Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang
Title: Visual Instruction Pretraining for Domain-Specific Foundation Models
Abstract:
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.
中文摘要:本文提出视觉指令预训练(ViTP)新范式,通过将推理融入感知来增强基础模型预训练,在遥感和医学影像等多个挑战性基准测试中实现了最先进的性能。
English Summary: This paper introduces Visual Instruction Pretraining (ViTP), a novel paradigm that integrates reasoning into perception to enhance foundation model pretraining, achieving state-of-the-art results across multiple challenging benchmarks in remote sensing and medical imaging.

Authors:Dian Jin, Yanghao Zhou, Jinxing Zhou, Jiaqi Ma, Ruohao Guo, Dan Guo
Title: SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
Abstract:
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.
中文:SimToken框架通过整合多模态大语言模型与Segment Anything Model,利用生成的特殊语义令牌指导视频对象分割,并借助目标一致性语义对齐损失提升性能,在Ref-AVS基准测试中表现优异。
English: The proposed SimToken framework integrates a multimodal large language model with the Segment Anything Model to segment video objects based on audio-visual-text references, achieving superior performance through a novel semantic alignment loss.

Authors:Wenhao Zhuang, Yuan Sun, Xiaobing Zhao
Title: Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages
Abstract:
As large language models (LLMs) are trained on increasingly diverse and extensive multilingual corpora, they demonstrate cross-lingual transfer capabilities. However, these capabilities often fail to effectively extend to low-resource languages, particularly those utilizing non-Latin scripts. While transliterating low-resource languages into Latin script presents a natural solution, there currently lacks a comprehensive framework for integrating transliteration into LLMs training and deployment. Taking a pragmatic approach, this paper innovatively combines character transliteration with Huffman coding to design a complete transliteration framework. Our proposed framework offers the following advantages: 1) Compression: Reduces storage requirements for low-resource language content, achieving up to 50% reduction in file size and 50-80% reduction in token count. 2) Accuracy: Guarantees 100% lossless conversion from transliterated text back to the source language. 3) Efficiency: Eliminates the need for vocabulary expansion for low-resource languages, improving training and inference efficiency. 4) Scalability: The framework can be extended to other low-resource languages. We validate the effectiveness of our framework across multiple downstream tasks, including text classification, machine reading comprehension, and machine translation. Experimental results demonstrate that our method significantly enhances the model's capability to process low-resource languages while maintaining performance on high-resource languages. Our data and code are publicly available at https://github.com/CMLI-NLP/HuffmanTranslit.
中文摘要:本文创新性地结合字符音译与霍夫曼编码,提出一个完整的音译框架,有效提升大语言模型对低资源语言的处理能力,在保持高资源语言性能的同时实现显著压缩、无损转换和效率优化。
English Summary: This paper introduces a novel transliteration framework combining character transliteration with Huffman coding to enhance LLMs' processing of low-resource languages, achieving significant compression, lossless accuracy, and improved efficiency without vocabulary expansion.

Authors:Qinghua Lin, Guang-Hai Liu, Zuoyong Li, Yang Li, Yuting Jiang, Xiang Wu
Title: Multimodal Medical Image Classification via Synergistic Learning Pre-training
Abstract:
Multimodal pathological images are usually in clinical diagnosis, but computer vision-based multimodal image-assisted diagnosis faces challenges with modality fusion, especially in the absence of expert-annotated data. To achieve the modality fusion in multimodal images with label scarcity, we propose a novel ``pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification. Specifically, we propose a synergistic learning pretraining framework of consistency, reconstructive, and aligned learning. By treating one modality as an augmented sample of another modality, we implement a self-supervised learning pre-train, enhancing the baseline model's feature representation capability. Then, we design a fine-tuning method for multimodal fusion. During the fine-tuning stage, we set different encoders to extract features from the original modalities and provide a multimodal fusion encoder for fusion modality. In addition, we propose a distribution shift method for multimodal fusion features, which alleviates the prediction uncertainty and overfitting risks caused by the lack of labeled samples. We conduct extensive experiments on the publicly available gastroscopy image datasets Kvasir and Kvasirv2. Quantitative and qualitative results demonstrate that the proposed method outperforms the current state-of-the-art classification methods. The code will be released at: https://github.com/LQH89757/MICS.
Chinese: 本研究提出了一种新颖的“预训练+微调”框架,通过协同学习增强特征表示,解决了多模态图像融合的难题,并在胃镜数据集上实现了最先进的分类性能。
English: This study introduces a novel "pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification, which enhances feature representation through synergistic learning and addresses modality fusion challenges, achieving state-of-the-art performance on gastroscopy datasets.

Authors:Xingqi Wang, Yiming Cui, Xin Yao, Shijin Wang, Guoping Hu, Xiaoyu Qin
Title: ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding
Abstract:
Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies. Code and data are available at https://github.com/ymcui/ChartHal .
中文摘要:大型视觉语言模型在图表理解中存在严重幻觉问题,ChartHal基准测试显示即使GPT-5和o4-mini等先进模型准确率也极低,凸显了改进缓解策略的迫切需求。
English Summary: Large Vision-Language Models exhibit severe hallucination issues in chart understanding, as demonstrated by the ChartHal benchmark where even advanced models like GPT-5 and o4-mini show low accuracy, highlighting the need for better mitigation strategies.

Authors:Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He
Title: MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion
Abstract:
Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05\% and +4.18\% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus.
Chinese: 提出的MVCL-DAF++模型通过原型感知对比对齐和粗细粒度注意力融合模块,有效解决了多模态意图识别中的语义基础薄弱和噪声鲁棒性问题,在基准数据集上实现了最优性能并显著提升了稀有类别的识别准确率。
English: The proposed MVCL-DAF++ model addresses multimodal intent recognition challenges by introducing prototype-aware contrastive alignment and coarse-to-fine attention fusion, achieving state-of-the-art performance with significant improvements in rare-class recognition on benchmark datasets.

Authors:Tong Chen, Zimu Wang, Yiyi Miao, Haoran Luo, Yuanfei Sun, Wei Wang, Zhengyong Jiang, Procheta Sen, Jionglong Su
Title: MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses
Abstract:
Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at https://github.com/AshleyChenNLP/MedFact.
中文: MedFact是首个基于证据的中文医学事实核查数据集,专门针对大语言模型生成的医学内容,包含1,321个问题和7,409条声明,用于评估大语言模型在真实医疗场景中的能力与挑战。
English: MedFact is the first evidence-based Chinese dataset for fact-checking medical content generated by large language models (LLMs), comprising 1,321 questions and 7,409 claims to evaluate LLMs' capabilities and challenges in real-world medical scenarios.

Authors:Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim
Title: QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
Abstract:
The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.
中文摘要:QWHA是一种新颖方法,通过采用沃尔什-哈达玛变换集成傅里叶相关变换适配器,有效降低大语言模型的量化误差和计算成本,在低比特量化中实现了更优的准确性和训练效率。
English Summary: QWHA is a novel method that integrates Fourier-related transform adapters using the Walsh-Hadamard Transform to effectively reduce quantization errors and computational costs in large language models, achieving superior accuracy and training efficiency in low-bit quantization.

Authors:Hyesung Jeon, Seojune Lee, Beomseok Kang, Yulhwa Kim, Jae-Joon Kim
Title: QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
Abstract:
The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.
中文摘要:QWHA是一种新颖方法,通过采用沃尔什-哈达玛变换集成傅里叶相关变换适配器,有效降低大语言模型的量化误差和计算成本,在低比特量化中实现了更优的准确性和训练效率。
English Summary: QWHA is a novel method that integrates Fourier-related transform adapters using the Walsh-Hadamard Transform to effectively reduce quantization errors and computational costs in large language models, achieving superior accuracy and training efficiency in low-bit quantization.

Authors:Pramit Sahoo, Maharaj Brahma, Maunendra Sankar Desarkar
Title: DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context
Abstract:
Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment \citep{ryan-etal-2024-unintended, alkhamissi-etal-2024-investigating} and produce biased generations \cite{naous-etal-2024-beer} due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises $\sim$8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: \href{https://huggingface.co/datasets/nlip/DIWALI}{https://huggingface.co/datasets/nlip/DIWALI}, project webpage\footnote{\href{https://nlip-lab.github.io/nlip/publications/diwali/}{https://nlip-lab.github.io/nlip/publications/diwali/}}, and our codebase with model outputs can be found here: \href{https://github.com/pramitsahoo/culture-evaluation}{https://github.com/pramitsahoo/culture-evaluation}.
中文: 大型语言模型常因文化知识不足而缺乏文化对齐并产生偏见输出,为此我们构建了印度文化数据集,通过多维度评估来检验其文化适应能力。
English: Large language models often lack cultural alignment and produce biased outputs due to insufficient cultural knowledge, prompting the creation of a new Indian cultural dataset to evaluate their competence through multi-faceted assessments.

Authors:Pramit Sahoo, Maharaj Brahma, Maunendra Sankar Desarkar
Title: DIWALI -- Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context
Abstract:
Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment \citep{ryan-etal-2024-unintended, alkhamissi-etal-2024-investigating} and produce biased generations \cite{naous-etal-2024-beer} due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises $\sim$8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: https://huggingface.co/datasets/nlip/DIWALI, project webpage https://nlip-lab.github.io/nlip/publications/diwali/, and our codebase with model outputs can be found here: https://github.com/pramitsahoo/culture-evaluation
中文: 大型语言模型常因文化知识不足而缺乏文化对齐并产生偏见输出,为此我们构建了印度文化数据集,通过多维度评估来检验其文化适应能力。
English: Large language models often lack cultural alignment and produce biased outputs due to insufficient cultural knowledge, prompting the creation of a new Indian cultural dataset to evaluate their competence through multi-faceted assessments.

Authors:Kang-il Lee, Jahyun Koo, Seunghyun Yoon, Minbeom Kim, Hyukhun Koh, Dongryeol Lee, Kyomin Jung
Title: Program Synthesis via Test-Time Transduction
Abstract:
We introduce transductive program synthesis, a new formulation of the program synthesis task that explicitly leverages test inputs during synthesis. While prior approaches to program synthesis--whether based on natural language descriptions or input-output examples--typically aim to generalize from training examples, they often struggle with robustness, especially in real-world settings where training examples are limited and test inputs involve various edge cases. To address this, we propose a novel framework that improves robustness by treating synthesis as an active learning over a finite hypothesis class defined by programs' outputs. We use an LLM to predict outputs for selected test inputs and eliminate inconsistent hypotheses, where the inputs are chosen via a greedy maximin algorithm to minimize the number of LLM queries required. We evaluate our approach on four benchmarks: Playgol, MBPP+, 1D-ARC, and programmatic world modeling on MiniGrid. We demonstrate that our method significantly improves program synthesis in both accuracy and efficiency. We release our code at https://github.com/klee972/SYNTRA.
中文摘要:本文提出转导式程序综合方法,通过主动选择测试输入并利用大语言模型优化程序假设,显著提升了多个基准测试的准确性和效率。
English Summary: This paper introduces transductive program synthesis, a framework that enhances robustness by actively selecting test inputs to refine program hypotheses using an LLM, significantly improving accuracy and efficiency across multiple benchmarks.

Authors:Junzhe Wu, Yufei Jia, Yiyi Yan, Zhixing Chen, Tiao Tan, Zifan Wang, Guangyu Wang
Title: FGGS-LiDAR: Ultra-Fast, GPU-Accelerated Simulation from General 3DGS Models to LiDAR
Abstract:
While 3D Gaussian Splatting (3DGS) has revolutionized photorealistic rendering, its vast ecosystem of assets remains incompatible with high-performance LiDAR simulation, a critical tool for robotics and autonomous driving. We present \textbf{FGGS-LiDAR}, a framework that bridges this gap with a truly plug-and-play approach. Our method converts \textit{any} pretrained 3DGS model into a high-fidelity, watertight mesh without requiring LiDAR-specific supervision or architectural alterations. This conversion is achieved through a general pipeline of volumetric discretization and Truncated Signed Distance Field (TSDF) extraction. We pair this with a highly optimized, GPU-accelerated ray-casting module that simulates LiDAR returns at over 500 FPS. We validate our approach on indoor and outdoor scenes, demonstrating exceptional geometric fidelity; By enabling the direct reuse of 3DGS assets for geometrically accurate depth sensing, our framework extends their utility beyond visualization and unlocks new capabilities for scalable, multimodal simulation. Our open-source implementation is available at https://github.com/TATP-233/FGGS-LiDAR.
中文: FGGS-LiDAR 可将任何预训练的3D高斯溅射模型转换为高保真网格,实现实时激光雷达模拟,无需特定训练即可支持可扩展的多模态应用。
English: FGGS-LiDAR converts any pretrained 3D Gaussian Splatting model into a high-fidelity mesh for real-time LiDAR simulation, enabling scalable multimodal applications without requiring LiDAR-specific training.

Authors:Ziqing Zou, Cong Wang, Yue Hu, Xiao Liu, Bowen Xu, Rong Xiong, Changjie Fan, Yingfeng Chen, Yue Wang
Title: High-Precision and High-Efficiency Trajectory Tracking for Excavators Based on Closed-Loop Dynamics
Abstract:
The complex nonlinear dynamics of hydraulic excavators, such as time delays and control coupling, pose significant challenges to achieving high-precision trajectory tracking. Traditional control methods often fall short in such applications due to their inability to effectively handle these nonlinearities, while commonly used learning-based methods require extensive interactions with the environment, leading to inefficiency. To address these issues, we introduce EfficientTrack, a trajectory tracking method that integrates model-based learning to manage nonlinear dynamics and leverages closed-loop dynamics to improve learning efficiency, ultimately minimizing tracking errors. We validate our method through comprehensive experiments both in simulation and on a real-world excavator. Comparative experiments in simulation demonstrate that our method outperforms existing learning-based approaches, achieving the highest tracking precision and smoothness with the fewest interactions. Real-world experiments further show that our method remains effective under load conditions and possesses the ability for continual learning, highlighting its practical applicability. For implementation details and source code, please refer to https://github.com/ZiqingZou/EfficientTrack.
中文: EfficientTrack是一种创新的轨迹跟踪方法,它结合基于模型的学习有效处理液压挖掘机的非线性动力学,在仿真和实际应用中均以最少的环境交互实现了最优的跟踪精度和平稳性。
English: EfficientTrack is a novel trajectory tracking method that integrates model-based learning to effectively handle the nonlinear dynamics of hydraulic excavators, achieving superior precision and smoothness with minimal environmental interactions in both simulations and real-world applications.

Authors:Zhizhang FU, Guangsheng Bao, Hongbo Zhang, Chenkai Hu, Yue Zhang
Title: Correlation or Causation: Analyzing the Causal Structures of LLM and LRM Reasoning Process
Abstract:
LLMs suffer from critical reasoning issues such as unfaithfulness, bias, and inconsistency, since they lack robust causal underpinnings and may rely on superficial correlations rather than genuine understanding. Successive LRMs have emerged as a promising alternative, leveraging advanced training techniques such as reinforcement learning (RL) and distillation to improve task accuracy. However, the impact of these training methods on causality remains largely unexplored. In this study, we conduct a systematic causal analysis on LLMs and LRMs, examining structural causal models (SCMs) of four key variables: problem instruction (Z), thinking process (T), reasoning steps (X), and answer (Y). Our findings reveal that RLVR-trained LRMs exhibit enhanced causal reasoning capabilities, aligning more closely with ideal causal structures, while LLMs and distilled LRMs fail to address causality-related deficiencies. Our further investigation indicates that RLVR reduces spurious correlations and strengthens genuine causal patterns, thereby mitigating unfaithfulness and bias. In addition, our inspection on the dynamics of the RLVR training process observes a high correlation between reduced spurious features and improved causal structures, where the causal relationships consistently improve in the training process. This study contributes to the understanding of causality in reasoning models, highlights the critical role of RLVR in enhancing causal reasoning, and provides insights for designing future AI systems with stronger causal foundations. We release our code and data at https://github.com/Harryking1999/CoT_Causal_Analysis.
中文:大语言模型因缺乏稳健的因果基础而存在推理缺陷,而采用强化学习价值正则化训练的连续推理模型通过消除伪相关、强化真实因果模式,展现出更优的因果推理能力。
English: Large language models (LLMs) exhibit reasoning flaws due to weak causality, while reinforcement learning with value regularization (RLVR)-trained language reasoning models (LRMs) demonstrate enhanced causal reasoning by reducing spurious correlations and strengthening genuine causal patterns.

Authors:Minglai Yang, Reyan Ahmed
Title: Word2VecGD: Neural Graph Drawing with Cosine-Stress Optimization
Abstract:
We propose a novel graph visualization method leveraging random walk-based embeddings to replace costly graph-theoretical distance computations. Using word2vec-inspired embeddings, our approach captures both structural and semantic relationships efficiently. Instead of relying on exact shortest-path distances, we optimize layouts using cosine dissimilarities, significantly reducing computational overhead. Our framework integrates differentiable stress optimization with stochastic gradient descent (SGD), supporting multi-criteria layout objectives. Experimental results demonstrate that our method produces high-quality, semantically meaningful layouts while efficiently scaling to large graphs. Code available at: https://github.com/mlyann/graphv_nn
中文摘要:本文提出了一种新颖的图可视化方法,利用基于随机游走的嵌入和余弦相异度替代高成本的距离计算,通过SGD优化实现了高质量布局并显著降低了计算开销。
English Summary: This paper introduces an efficient graph visualization method that uses random walk-based embeddings and cosine dissimilarities to replace expensive distance computations, achieving high-quality layouts with reduced computational cost through SGD optimization.

Authors:Weihua Du, Hailei Gong, Zhan Ling, Kang Liu, Lingfeng Shen, Xuesong Yao, Yufei Xu, Dingyuan Shi, Yiming Yang, Jiecao Chen
Title: Generalizable End-to-End Tool-Use RL with Synthetic CodeGym
Abstract:
Tool-augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine-tuning (SFT) over static trajectories or reinforcement learning (RL) on narrow tasks, and generalize poorly beyond development settings, leading to brittleness with new tools and unseen workflows. Because code execution reflects many structures of real-world workflows, coding problems provide a natural basis for building agent training environments. Motivated by this, we introduce CodeGym, a scalable framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL, enabling LLM agents to explore and master various workflows actively. CodeGym rewrites static coding problems into interactive environments by extracting atomic functions or logic into callable tools, yielding verifiable tasks that span various tool-execution workflows. Models of varying sizes and chain-of-thought configurations, trained in CodeGym, exhibit consistent out-of-distribution generalizability; for example, Qwen2.5-32B-Instruct achieves an absolute accuracy gain of 8.7 points on the OOD benchmark $τ$-Bench. These results highlight CodeGym as a step toward scalable general-purpose RL environments that align with real-world agent workflows.
中文摘要:CodeGym是一个可扩展框架,通过将静态编程问题转化为交互式环境来训练LLM智能体,使其在分布外任务上展现出显著提升的泛化能力。
English Summary: CodeGym is a scalable framework that transforms static coding problems into interactive environments for training LLM agents through reinforcement learning, significantly enhancing their generalization capabilities on out-of-distribution tasks.

Authors:Buyin Deng, Lingxin Huang, Kai Luo, Fei Teng, Kailun Yang
Title: DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking
Abstract:
Visual Multi-Object Tracking (MOT) is a crucial component of robotic perception, yet existing Tracking-By-Detection (TBD) methods often rely on 2D cues, such as bounding boxes and motion modeling, which struggle under occlusions and close-proximity interactions. Trackers relying on these 2D cues are particularly unreliable in robotic environments, where dense targets and frequent occlusions are common. While depth information has the potential to alleviate these issues, most existing MOT datasets lack depth annotations, leading to its underexploited role in the domain. To unveil the potential of depth-informed trajectory refinement, we introduce DepTR-MOT, a DETR-based detector enhanced with instance-level depth information. Specifically, we propose two key innovations: (i) foundation model-based instance-level soft depth label supervision, which refines depth prediction, and (ii) the distillation of dense depth maps to maintain global depth consistency. These strategies enable DepTR-MOT to output instance-level depth during inference, without requiring foundation models and without additional computational cost. By incorporating depth cues, our method enhances the robustness of the TBD paradigm, effectively resolving occlusion and close-proximity challenges. Experiments on both the QuadTrack and DanceTrack datasets demonstrate the effectiveness of our approach, achieving HOTA scores of 27.59 and 44.47, respectively. In particular, results on QuadTrack, a robotic platform MOT dataset, highlight the advantages of our method in handling occlusion and close-proximity challenges in robotic tracking. The source code will be made publicly available at https://github.com/warriordby/DepTR-MOT.
中文: DepTR-MOT提出了一种基于DETR的多目标跟踪器,通过实例级深度信息增强,在不增加计算成本的情况下有效解决了机器人环境中遮挡和近距离交互的跟踪难题。
English: DepTR-MOT introduces a DETR-based multi-object tracker enhanced with instance-level depth information, effectively addressing occlusion and proximity challenges in robotic environments by incorporating depth cues without additional computational costs.

Authors:Zhuofan Chen, Jiyuan He, Yichi Zhang, Xing Hu, Haoxing Wen, Jun Bai, Wenge Rong
Title: CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models
Abstract:
Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi-step reasoning and abstract conceptual integration. While recent test-time scaling techniques rely heavily on high-quality, challenging problems, the scarcity of Olympiad-level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom-based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units, cognitive atoms, extracted from human-authored solutions. A diversity-promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint-based recombination mechanism ensures logical soundness and structural validity. The combinatorial nature of the graph structure provides a near-infinite space of reasoning paths, and the walk algorithm systematically explores this space to achieve large-scale synthesis of high-quality problems; meanwhile, by controlling the number of cognitive atoms, we can precisely adjust problem difficulty, ensuring diversity, scalability, and controllability of the generated problems. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high-quality math problem generation.Our code is publicly available at https://github.com/Icarus-1111/CogAtom.
中文:CogAtom提出了一种基于认知原子的框架,通过重组基本推理单元来合成数学严谨且多样化的问题,实现了可扩展、高质量且难度可控的数学题目生成。
English: CogAtom introduces a cognitive atom-based framework that synthesizes mathematically rigorous and diverse problems by recombining fundamental reasoning units, enabling scalable, high-quality math problem generation with precise difficulty control.

Authors:Sydney Anuyah, Mehedi Mahmud Kaushik, Krishna Dwarampudi, Rakesh Shiradkar, Arjan Durresi, Sunandan Chakraborty
Title: Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling
Abstract:
We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at https://github.com/KaushikMahmud/CoDe-KG_EMNLP_2025
中文摘要:CoDe-KG是一个结合指代消解与句法分解的开源知识图谱抽取系统,在关系抽取任务上达到最优性能,并显著提升了对罕见关系的召回率。
English Summary: CoDe-KG is an open-source pipeline that combines coreference resolution with syntactic decomposition to extract sentence-level knowledge graphs, achieving state-of-the-art performance on relation extraction and significantly improving recall for rare relations.

Authors:Mandip Goswami
Title: BeepBank-500: A Synthetic Earcon Mini-Corpus for UI Sound Research and Psychoacoustics Research
Abstract:
We introduce BeepBank-500, a compact, fully synthetic earcon/alert dataset (300-500 clips) designed for rapid, rights-clean experimentation in human-computer interaction and audio machine learning. Each clip is generated from a parametric recipe controlling waveform family (sine, square, triangle, FM), fundamental frequency, duration, amplitude envelope, amplitude modulation (AM), and lightweight Schroeder-style reverberation. We use three reverberation settings: dry, and two synthetic rooms denoted 'rir small' ('small') and 'rir medium' ('medium') throughout the paper and in the metadata. We release mono 48 kHz WAV audio (16-bit), a rich metadata table (signal/spectral features), and tiny reproducible baselines for (i) waveform-family classification and (ii) f0 regression on single tones. The corpus targets tasks such as earcon classification, timbre analyses, and onset detection, with clearly stated licensing and limitations. Audio is dedicated to the public domain via CC0-1.0; code is under MIT. Data DOI: https://doi.org/10.5281/zenodo.17172015. Code: https://github.com/mandip42/earcons-mini-500.
中文: BeepBank-500是一个包含300-500个样本的紧凑型合成提示音数据集,专为人机交互和音频机器学习设计,采用参数化声音生成并遵循公共领域许可协议。
English: BeepBank-500 is a compact synthetic earcon dataset with 300-500 clips, designed for rights-clean experimentation in HCI and audio ML, featuring parametric sound generation and public domain licensing.

Authors:Yassine Kebbati, Naima Ait-Oufroukh, Vincent Vigneron, Dalil Ichala
Title: Neural Network and ANFIS based auto-adaptive MPC for path tracking in autonomous vehicles
Abstract:
Self-driving cars operate in constantly changing environments and are exposed to a variety of uncertainties and disturbances. These factors render classical controllers ineffective, especially for lateral control. Therefore, an adaptive MPC controller is designed in this paper for the path tracking task, tuned by an improved particle swarm optimization algorithm. Online parameter adaptation is performed using Neural Networks and ANFIS. The designed controller showed promising results compared to standard MPC in triple lane change and trajectory tracking scenarios. Code can be found here: https://github.com/yassinekebbati/NN_MPC-vs-ANFIS_MPC
中文: 本文设计了一种基于神经网络和ANFIS优化的自适应MPC控制器,用于自动驾驶汽车的路径跟踪,在复杂场景中相比标准MPC展现出更优性能。
English: This paper develops an adaptive MPC controller enhanced by neural networks and ANFIS for autonomous vehicle path tracking, demonstrating superior performance over standard MPC in complex driving scenarios.

Authors:Jinchao Ge, Tengfei Cheng, Biao Wu, Zeyu Zhang, Shiya Huang, Judith Bishop, Gillian Shepherd, Meng Fang, Ling Chen, Yang Zhao
Title: VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
Abstract:
Analyzing cultural-heritage artifacts remains challenging for MLLMs: general models lack domain expertise, and SFT often overfits superficial patterns, yielding brittle reasoning for authentication and historical attribution. This raises the question of how to equip MLLMs with robust, expert-level reasoning for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns evaluation into supervision: we construct a taxonomy of question types, probe the SFT model to localize type-specific performance gaps, and optimize with type-conditioned, compositionality-oriented rewards targeting those gaps. We also release VaseVQA, a comprehensive benchmark of 31,773 images designed to probe deep understanding. Experiments show state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, validating diagnosis-guided, taxonomy-conditioned reward engineering and providing a reusable resource for future research. Code and dataset will be available at https://github.com/AIGeeksGroup/VaseVQA.
中文总结:VaseVL系统通过诊断引导的强化学习提升多模态大模型对文物的专业推理能力,在风格分类和历史归属任务中实现最优性能,并发布了可复用的评测数据集。
English Summary: VaseVL is a system that enhances MLLMs' reasoning for cultural artifacts through targeted reinforcement learning, achieving state-of-the-art performance in authentication tasks while introducing a comprehensive benchmark dataset.

Authors:Kabir Hamzah Muhammad, Marawan Elbatel, Yi Qin, Xiaomeng Li
Title: Echo-Path: Pathology-Conditioned Echo Video Generation
Abstract:
Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, and echocardiography is critical for diagnosis of both common and congenital cardiac conditions. However, echocardiographic data for certain pathologies are scarce, hindering the development of robust automated diagnosis models. In this work, we propose Echo-Path, a novel generative framework to produce echocardiogram videos conditioned on specific cardiac pathologies. Echo-Path can synthesize realistic ultrasound video sequences that exhibit targeted abnormalities, focusing here on atrial septal defect (ASD) and pulmonary arterial hypertension (PAH). Our approach introduces a pathology-conditioning mechanism into a state-of-the-art echo video generator, allowing the model to learn and control disease-specific structural and motion patterns in the heart. Quantitative evaluation demonstrates that the synthetic videos achieve low distribution distances, indicating high visual fidelity. Clinically, the generated echoes exhibit plausible pathology markers. Furthermore, classifiers trained on our synthetic data generalize well to real data and, when used to augment real training sets, it improves downstream diagnosis of ASD and PAH by 7\% and 8\% respectively. Code, weights and dataset are available here https://github.com/Marshall-mk/EchoPathv1
中文:Echo-Path框架通过生成特定心脏病变的逼真超声心动图视频来解决数据稀缺问题,利用合成数据增强使ASD和PAH的自动诊断准确率分别提升7%和8%。
English: The proposed Echo-Path framework generates realistic echocardiogram videos with targeted cardiac pathologies to address data scarcity, improving automated diagnosis of conditions like ASD and PAH by 7-8% through synthetic data augmentation.

Authors:Yuhao Tian, Zheming Yang
Title: SAEC: Scene-Aware Enhanced Edge-Cloud Collaborative Industrial Vision Inspection with Multimodal LLM
Abstract:
Industrial vision inspection requires high accuracy under stringent resource constraints, yet existing approaches face a fundamental trade-off. Multimodal LLMs (MLLMs) deliver strong reasoning capabilities but incur prohibitive computational costs, while lightweight edge models often fail on complex cases. In this paper, we present SAEC, a scene-aware enhanced edge-cloud collaborative industrial vision inspection framework with MLLM. The framework is composed of three synergistic components: (1) Efficient MLLM Fine-Tuning for Complex Defect Inspection, (2) Lightweight Multiscale Scene-Complexity Estimation, and (3) Adaptive Edge-Cloud Scheduler. Together, these modules enable robust defect detection by tailoring multimodal reasoning to scene complexity and dynamically balancing computation between edge and cloud resources. Experimental results on MVTec AD and KSDD2 datasets demonstrate that SAEC attains 85.11% and 82.72% accuracy, surpassing Qwen by 22.1% and 20.8%, and LLaVA by 33.3% and 31.6%. It also reduces runtime by up to 22.4% and cuts energy per correct decision by 40%-74%. The code is available at https://github.com/YuHao-Tian/SAEC.
中文:SAEC是一种创新的边云协同框架,通过基于场景复杂度动态分配任务,显著提升了工业视觉检测的准确性和效率,超越了现有模型。
English: SAEC is a novel edge-cloud collaborative framework that enhances industrial vision inspection by dynamically allocating tasks based on scene complexity, achieving higher accuracy and efficiency than existing models.

Authors:Hang Xu, Zang Yu, Yehui Tang, Pengbo Hu, Yuhao Tang, Hao Dong
Title: MCTS-EP: Empowering Embodied Planning with Online Preference Optimization
Abstract:
This paper introduces MCTS-EP, an online learning framework that combines large language models (LLM) with Monte Carlo Tree Search (MCTS) for training embodied agents. MCTS-EP integrates three key components: MCTS-guided exploration for preference data collection, efficient multi-modal reasoning mechanism, and iterative training pipeline based on preference optimization. We theoretically prove that MCTS-EP achieves better performance bounds than conventional on-policy algorithms when the loss function is strongly convex, and demonstrate that it can be formulated as a search-enhanced variant of GAIL. MCTS-EP achieves state-of-the-art performace across serval benchmarks. In ALFWorld, it achieves 92% and 87% success rates for textual and visual tasks. In WebShop, it reaches an average reward of 0.81. MTCS-EP also reduces average interaction steps from from 18.7/19.5 to 10.2/9.9 steps in visual ALFWorld.Code available at: https://github.com/xuhang-2/Embodied-Agent-Planning
中文: 本文提出MCTS-EP框架,通过结合大语言模型与蒙特卡洛树搜索训练具身智能体,在多项基准测试中实现最优性能,并显著提升任务成功率与交互效率。
English: This paper presents MCTS-EP, an online learning framework integrating large language models with Monte Carlo Tree Search to train embodied agents, achieving state-of-the-art performance across multiple benchmarks through enhanced exploration and optimization.

Authors:Lingzhao Kong, Jiacheng Lin, Siyu Li, Kai Luo, Zhiyong Li, Kailun Yang
Title: CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception
Abstract:
Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird's Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@50 for LiDAR-based 3D object detection by +3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made publicly available at https://github.com/godk0509/CoBEVMoE.
Chinese: 提出的CoBEVMoE框架通过动态专家混合架构,在协同感知中同时建模智能体间的特征相似性与异质性,在基准数据集上实现了最优性能。
English: The proposed CoBEVMoE framework enhances collaborative perception by dynamically modeling both feature similarities and heterogeneities across agents through a Dynamic Mixture-of-Experts architecture, achieving state-of-the-art performance on benchmark datasets.

Authors:Yuzhu Li, An Sui, Fuping Wu, Xiahai Zhuang
Title: Uncertainty-Supervised Interpretable and Robust Evidential Segmentation
Abstract:
Uncertainty estimation has been widely studied in medical image segmentation as a tool to provide reliability, particularly in deep learning approaches. However, previous methods generally lack effective supervision in uncertainty estimation, leading to low interpretability and robustness of the predictions. In this work, we propose a self-supervised approach to guide the learning of uncertainty. Specifically, we introduce three principles about the relationships between the uncertainty and the image gradients around boundaries and noise. Based on these principles, two uncertainty supervision losses are designed. These losses enhance the alignment between model predictions and human interpretation. Accordingly, we introduce novel quantitative metrics for evaluating the interpretability and robustness of uncertainty. Experimental results demonstrate that compared to state-of-the-art approaches, the proposed method can achieve competitive segmentation performance and superior results in out-of-distribution (OOD) scenarios while significantly improving the interpretability and robustness of uncertainty estimation. Code is available via https://github.com/suiannaius/SURE.
Chinese: 本研究提出了一种自监督的医学图像分割不确定性估计方法,通过新的监督损失函数提升了解释性和鲁棒性,在分布外场景中取得了优异结果和竞争力表现。
English: This study introduces a self-supervised method for uncertainty estimation in medical image segmentation, using novel supervision losses to enhance interpretability and robustness, achieving competitive performance and superior results in out-of-distribution scenarios.

Authors:Jie Chen, Yuhong Feng, Tao Dai, Mingzhe Liu, Hongtao Chen, Zhaoxi He, Jiancong Bai
Title: SFN-YOLO: Towards Free-Range Poultry Detection via Scale-aware Fusion Networks
Abstract:
Detecting and localizing poultry is essential for advancing smart poultry farming. Despite the progress of detection-centric methods, challenges persist in free-range settings due to multiscale targets, obstructions, and complex or dynamic backgrounds. To tackle these challenges, we introduce an innovative poultry detection approach named SFN-YOLO that utilizes scale-aware fusion. This approach combines detailed local features with broader global context to improve detection in intricate environments. Furthermore, we have developed a new expansive dataset (M-SCOPE) tailored for varied free-range conditions. Comprehensive experiments demonstrate our model achieves an mAP of 80.7% with just 7.2M parameters, which is 35.1% fewer than the benchmark, while retaining strong generalization capability across different domains. The efficient and real-time detection capabilities of SFN-YOLO support automated smart poultry farming. The code and dataset can be accessed at https://github.com/chenjessiee/SFN-YOLO.
中文摘要:SFN-YOLO通过尺度感知融合方法,在复杂散养环境中结合局部特征与全局上下文来提升家禽检测效果,以更少参数实现80.7%的mAP准确率,为智慧养殖提供自动化支持。
English Summary: SFN-YOLO introduces a scale-aware fusion approach that enhances poultry detection in complex free-range environments by integrating local features with global context, achieving 80.7% mAP with reduced parameters while supporting automated smart farming.

Authors:Binhua Huang, Ni Wang, Arjun Pakrashi, Soumyabrata Dev
Title: MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
Abstract:
Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at https://github.com/microa/MoCLIP-Lite.
Chinese: MoCLIP-Lite是一种高效的双流视频识别框架,通过结合冻结的CLIP图像编码器和轻量级运动矢量网络,在UCF101数据集上实现了89.2%的准确率,且仅需极少的训练成本。
English: MoCLIP-Lite is an efficient two-stream video recognition framework that combines a frozen CLIP image encoder with a lightweight motion vector network, achieving 89.2% accuracy on UCF101 while requiring minimal training.

Authors:Yuhong Feng, Hongtao Chen, Qi Zhang, Jie Chen, Zhaoxi He, Mingzhe Liu, Jianghai Liao
Title: A Dual-Modulation Framework for RGB-T Crowd Counting via Spatially Modulated Attention and Adaptive Fusion
Abstract:
Accurate RGB-Thermal (RGB-T) crowd counting is crucial for public safety in challenging conditions. While recent Transformer-based methods excel at capturing global context, their inherent lack of spatial inductive bias causes attention to spread to irrelevant background regions, compromising crowd localization precision. Furthermore, effectively bridging the gap between these distinct modalities remains a major hurdle. To tackle this, we propose the Dual Modulation Framework, comprising two modules: Spatially Modulated Attention (SMA), which improves crowd localization by using a learnable Spatial Decay Mask to penalize attention between distant tokens and prevent focus from spreading to the background; and Adaptive Fusion Modulation (AFM), which implements a dynamic gating mechanism to prioritize the most reliable modality for adaptive cross-modal fusion. Extensive experiments on RGB-T crowd counting datasets demonstrate the superior performance of our method compared to previous works. Code available at https://github.com/Cht2924/RGBT-Crowd-Counting.
中文: 提出的双重调制框架通过空间调制注意力提升人群定位精度,并采用自适应融合调制实现动态跨模态整合,在RGB-T人群计数数据集上取得了领先的性能表现。
English: The proposed Dual Modulation Framework enhances RGB-Thermal crowd counting by introducing Spatially Modulated Attention to improve localization precision and Adaptive Fusion Modulation for dynamic cross-modal integration, achieving superior performance on benchmark datasets.

Authors:Kihyun Kim, Michalis Lazarou, Tania Stathaki
Title: Enhanced Detection of Tiny Objects in Aerial Images
Abstract:
While one-stage detectors like YOLOv8 offer fast training speed, they often under-perform on detecting small objects as a trade-off. This becomes even more critical when detecting tiny objects in aerial imagery due to low-resolution targets and cluttered backgrounds. To address this, we introduce three enhancement strategies -- input image resolution adjustment, data augmentation, and attention mechanisms -- that can be easily implemented on YOLOv8. We demonstrate that image size enlargement and the proper use of augmentation can lead to enhancement. Additionally, we designed a Mixture of Orthogonal Neural-modules Network (MoonNet) pipeline which consists of attention-augmented CNNs. Two well-known attention modules, the Squeeze-and-Excitation Block (SE Block) and the Convolutional Block Attention Module (CBAM), were integrated into the backbone of YOLOv8 with an increased number of channels, and the MoonNet backbone obtained improved detection accuracy compared to the original YOLOv8. MoonNet further proved its adaptability and potential by achieving state-of-the-art performance on a tiny-object benchmark when integrated with the YOLC model. Our codes are available at: https://github.com/Kihyun11/MoonNet
中文: 本文针对YOLOv8在航拍图像小目标检测中的不足,提出三种改进策略并设计MoonNet注意力增强网络,在微小目标检测基准上取得了最先进的性能表现。
English: This paper addresses YOLOv8's limitations in detecting small objects in aerial imagery by proposing three enhancement strategies and introducing MoonNet, an attention-augmented CNN backbone that achieves state-of-the-art performance on tiny-object detection benchmarks.

Authors:Kunrong Li, Kwan Hui Lim
Title: RALLM-POI: Retrieval-Augmented LLM for Zero-shot Next POI Recommendation with Geographical Reranking
Abstract:
Next point-of-interest (POI) recommendation predicts a user's next destination from historical movements. Traditional models require intensive training, while LLMs offer flexible and generalizable zero-shot solutions but often generate generic or geographically irrelevant results due to missing trajectory and spatial context. To address these issues, we propose RALLM-POI, a framework that couples LLMs with retrieval-augmented generation and self-rectification. We first propose a Historical Trajectory Retriever (HTR) that retrieves relevant past trajectories to serve as contextual references, which are then reranked by a Geographical Distance Reranker (GDR) for prioritizing spatially relevant trajectories. Lastly, an Agentic LLM Rectifier (ALR) is designed to refine outputs through self-reflection. Without additional training, RALLM-POI achieves substantial accuracy gains across three real-world Foursquare datasets, outperforming both conventional and LLM-based baselines. Code is released at https://github.com/LKRcrocodile/RALLM-POI.
中文摘要:RALLM-POI框架通过结合检索增强生成与自校正机制,利用历史轨迹和地理空间信息增强大语言模型的POI推荐能力,无需额外训练即在多个真实数据集上实现了显著优于传统方法的推荐精度。
English Summary: RALLM-POI is a novel framework that enhances next POI recommendation by integrating retrieval-augmented generation and self-rectification with LLMs, achieving superior accuracy without additional training by leveraging historical trajectories and geographical context.

Authors:Yao Du, Jiarong Guo, Xiaomeng Li
Title: CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner
Abstract:
Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at https://github.com/xmed-lab/CardiacCLIP.
中文:提出的CardiacCLIP框架通过注意力机制筛选关键帧并结合多尺度特征提取,显著提升了超声心动图中左心室射血分数的预测精度,在EchoNet-Dynamic数据集上平均绝对误差降低2.07。
English: The proposed CardiacCLIP framework enhances LVEF prediction in echocardiography by integrating attention-based frame selection and multi-scale feature extraction, achieving significantly improved accuracy with a 2.07 MAE reduction on EchoNet-Dynamic.

Authors:Shuang Liang, Chaochuan Hou, Xu Yao, Shiping Wang, Minqi Jiang, Songqiao Han, Hailiang Huang
Title: TSGym: Design Choices for Deep Multivariate Time-Series Forecasting
Abstract:
Recently, deep learning has driven significant advancements in multivariate time series forecasting (MTSF) tasks. However, much of the current research in MTSF tends to evaluate models from a holistic perspective, which obscures the individual contributions and leaves critical issues unaddressed. Adhering to the current modeling paradigms, this work bridges these gaps by systematically decomposing deep MTSF methods into their core, fine-grained components like series-patching tokenization, channel-independent strategy, attention modules, or even Large Language Models and Time-series Foundation Models. Through extensive experiments and component-level analysis, our work offers more profound insights than previous benchmarks that typically discuss models as a whole. Furthermore, we propose a novel automated solution called TSGym for MTSF tasks. Unlike traditional hyperparameter tuning, neural architecture searching or fixed model selection, TSGym performs fine-grained component selection and automated model construction, which enables the creation of more effective solutions tailored to diverse time series data, therefore enhancing model transferability across different data sources and robustness against distribution shifts. Extensive experiments indicate that TSGym significantly outperforms existing state-of-the-art MTSF and AutoML methods. All code is publicly available on https://github.com/SUFE-AILAB/TSGym.
中文: 本研究通过系统分析多元时间序列预测模型的细粒度组件,提出了自动化组件选择框架TSGym,有效解决了现有研究的局限性,并在实验中展现出超越现有方法的优越性能。
English: This study addresses limitations in current multivariate time series forecasting research by systematically analyzing fine-grained model components and introducing TSGym, an automated component selection framework that demonstrates superior performance over existing methods.

Authors:Haizhou Ge, Yufei Jia, Zheng Li, Yue Li, Zhixing Chen, Ruqi Huang, Guyue Zhou
Title: FILIC: Dual-Loop Force-Guided Imitation Learning with Impedance Torque Control for Contact-Rich Manipulation Tasks
Abstract:
Contact-rich manipulation is crucial for robots to perform tasks requiring precise force control, such as insertion, assembly, and in-hand manipulation. However, most imitation learning (IL) policies remain position-centric and lack explicit force awareness, and adding force/torque sensors to collaborative robot arms is often costly and requires additional hardware design. To overcome these issues, we propose FILIC, a Force-guided Imitation Learning framework with impedance torque control. FILIC integrates a Transformer-based IL policy with an impedance controller in a dual-loop structure, enabling compliant force-informed, force-executed manipulation. For robots without force/torque sensors, we introduce a cost-effective end-effector force estimator using joint torque measurements through analytical Jacobian-based inversion while compensating with model-predicted torques from a digital twin. We also design complementary force feedback frameworks via handheld haptics and VR visualization to improve demonstration quality. Experiments show that FILIC significantly outperforms vision-only and joint-torque-based methods, achieving safer, more compliant, and adaptable contact-rich manipulation. Our code can be found in https://github.com/TATP-233/FILIC.
中文: FILIC是一种力引导的模仿学习框架,通过结合基于Transformer的策略与阻抗控制,即使没有力传感器也能利用关节扭矩估计和触觉反馈实现柔顺的接触式操作。
English: FILIC is a force-guided imitation learning framework that integrates a Transformer-based policy with impedance control, enabling compliant manipulation even without force sensors by using joint torque estimation and haptic feedback.

Authors:Hang Du, Jiayang Zhang, Guoshun Nan, Wendi Deng, Zhenyan Chen, Chenyang Zhang, Wang Xiao, Shan Huang, Yuqi Pan, Tao Qi, Sicong Leng
Title: From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
Abstract:
Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an "easy to hard" approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal tasks.Our code and dataset are available at https://github.com/Shelly-coder239/MIRBench.
中文: MIR基准通过要求模型结合交错文本联合分析多张图像,并采用渐进式课程学习策略,显著提升了多模态大语言模型处理复杂跨模态任务的推理能力。
English: The MIR benchmark advances multi-modal reasoning by requiring models to jointly analyze multiple images with interleaved texts, using a progressive curriculum strategy that significantly improves performance on complex cross-modal tasks.

Authors:Yajing Yang, Tony Deng, Min-Yen Kan
Title: KAHAN: Knowledge-Augmented Hierarchical Analysis and Narration for Financial Data Narration
Abstract:
We propose KAHAN, a knowledge-augmented hierarchical framework that systematically extracts insights from raw tabular data at entity, pairwise, group, and system levels. KAHAN uniquely leverages LLMs as domain experts to drive the analysis. On DataTales financial reporting benchmark, KAHAN outperforms existing approaches by over 20% on narrative quality (GPT-4o), maintains 98.2% factuality, and demonstrates practical utility in human evaluation. Our results reveal that knowledge quality drives model performance through distillation, hierarchical analysis benefits vary with market complexity, and the framework transfers effectively to healthcare domains. The data and code are available at https://github.com/yajingyang/kahan.
中文: KAHAN是一个知识增强的分层框架,利用大语言模型作为领域专家从表格数据中提取洞察,在基准测试中展现出卓越的叙事质量、高事实准确性及优秀的跨领域迁移能力。
English: KAHAN is a knowledge-augmented hierarchical framework that uses LLMs as domain experts to extract insights from tabular data, achieving superior narrative quality, high factuality, and effective cross-domain transfer on benchmarks.

Authors:Wenxuan Fang, Jili Fan, Chao Wang, Xiantao Hu, Jiangwei Weng, Ying Tai, Jian Yang, Jun Li
Title: When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image Restoration
Abstract:
Adverse Weather Image Restoration (AWIR) is a highly challenging task due to the unpredictable and dynamic nature of weather-related degradations. Traditional task-specific methods often fail to generalize to unseen or complex degradation types, while recent prompt-learning approaches depend heavily on the degradation estimation capabilities of vision-language models, resulting in inconsistent restorations. In this paper, we propose \textbf{LCDiff}, a novel framework comprising two key components: \textit{Lumina-Chroma Decomposition Network} (LCDN) and \textit{Lumina-Guided Diffusion Model} (LGDM). LCDN processes degraded images in the YCbCr color space, separately handling degradation-related luminance and degradation-invariant chrominance components. This decomposition effectively mitigates weather-induced degradation while preserving color fidelity. To further enhance restoration quality, LGDM leverages degradation-related luminance information as a guiding condition, eliminating the need for explicit degradation prompts. Additionally, LGDM incorporates a \textit{Dynamic Time Step Loss} to optimize the denoising network, ensuring a balanced recovery of both low- and high-frequency features in the image. Finally, we present DriveWeather, a comprehensive all-weather driving dataset designed to enable robust evaluation. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods, setting a new benchmark in AWIR. The dataset and code are available at: https://github.com/fiwy0527/LCDiff.
中文: 提出的LCDiff框架通过在YCbCr色彩空间分解亮度和色度分量,并采用亮度引导的扩散模型与动态时间步长优化,有效恢复恶劣天气下的图像退化,在新型DriveWeather数据集上的实验表明其性能超越现有最佳方法。
English: The proposed LCDiff framework effectively restores weather-degraded images by decomposing luminance and chrominance components in YCbCr space and using luminance-guided diffusion with dynamic time step optimization, outperforming existing methods as validated on the new DriveWeather dataset.

Authors:Feng Han, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang
Title: VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation
Abstract:
Recently, autoregressive image generation models have wowed audiences with their remarkable capability in creating surprisingly realistic images. Models such as GPT-4o and LlamaGen can not only produce images that faithfully mimic renowned artistic styles like Ghibli, Van Gogh, or Picasso, but also potentially generate Not-Safe-For-Work (NSFW) content, raising significant concerns regarding copyright infringement and ethical use. Despite these concerns, methods to safeguard autoregressive text-to-image models remain underexplored. Previous concept erasure methods, primarily designed for diffusion models that operate in denoising latent space, are not directly applicable to autoregressive models that generate images token by token. To address this critical gap, we propose Visual Contrast Exploitation (VCE), a novel framework comprising: (1) an innovative contrastive image pair construction paradigm that precisely decouples unsafe concepts from their associated content semantics, and (2) a sophisticated DPO-based training approach that enhances the model's ability to identify and leverage visual contrastive features from image pairs, enabling precise concept erasure. Our comprehensive experiments across three challenging tasks-artist style erasure, explicit content erasure, and object removal-demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts. The code and models are available at https://github.com/Maplebb/VCE.
中文: 近期自回归图像生成模型虽能创作逼真图像,却引发版权与不良内容担忧;为此提出视觉对比利用(VCE)框架,通过对比图像对和基于DPO的训练,在消除危险概念的同时完美保留安全内容,实现精准防护。
English: Recent autoregressive image generation models like GPT-4o and LlamaGen produce highly realistic images but raise concerns about copyright and NSFW content, prompting the development of Visual Contrast Exploitation (VCE), a novel framework that effectively erases unsafe concepts while preserving safe ones through contrastive image pairs and DPO-based training.

Authors:Yuhang Jia, Xu Zhang, Yang Chen, Hui Wang, Enzhi Wang, Yong Qin
Title: Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs
Abstract:
Automatic mean opinion score (MOS) prediction provides a more perceptual alternative to objective metrics, offering deeper insights into the evaluated models. With the rapid progress of multimodal large language models (MLLMs), their enhanced perceptual and reasoning abilities enable more comprehensive and interpretable audio quality assessment. In this work, we tackle the challenging task of audio editing evaluation and propose the first natural language-based automated evaluation framework built on MLLMs. Our approach introduces two fine-tuning tasks to boost multi-audio understanding, combined with Chain-of-Thought prompting, and lightweight instruction tuning, to enhance step-by-step reasoning. Experiment demonstrate that our framework delivers accurate, interpretable, and text-based editing evaluation, closely aligning with human judgments and objective metrics while substantially improving over baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.
中文: 本研究首次提出了基于自然语言的多模态大语言模型自动评估框架,用于音频编辑评价,实现了与人类判断和客观指标高度一致的准确且可解释的结果。
English: This study introduces the first natural language-based automated evaluation framework using multimodal large language models for audio editing assessment, achieving accurate and interpretable results that align closely with human judgment and objective metrics.

Authors:Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji
Title: The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
Abstract:
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $J\&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/magic-research/Sa2VA.
中文:提出的SaSaSa2VA模型通过解决稀疏帧采样和单一标记限制来增强视频对象分割,借助分割增强和测试时集成方法,在RVOS挑战赛中取得了最佳性能。
English: The proposed SaSaSa2VA model enhances video object segmentation by addressing sparse frame sampling and single-token limitations, achieving top performance in the RVOS challenge through segmentation augmentation and test-time ensembling.

Authors:Leiyu Wang, Biao Jin, Feng Huang, Liqiong Chen, Zhengyong Wang, Xiaohai He, Honggang Chen
Title: MO R-CNN: Multispectral Oriented R-CNN for Object Detection in Remote Sensing Image
Abstract:
Oriented object detection for multi-spectral imagery faces significant challenges due to differences both within and between modalities. Although existing methods have improved detection accuracy through complex network architectures, their high computational complexity and memory consumption severely restrict their performance. Motivated by the success of large kernel convolutions in remote sensing, we propose MO R-CNN, a lightweight framework for multi-spectral oriented detection featuring heterogeneous feature extraction network (HFEN), single modality supervision (SMS), and condition-based multimodal label fusion (CMLF). HFEN leverages inter-modal differences to adaptively align, merge, and enhance multi-modal features. SMS constrains multi-scale features and enables the model to learn from multiple modalities. CMLF fuses multimodal labels based on specific rules, providing the model with a more robust and consistent supervisory signal. Experiments on the DroneVehicle, VEDAI and OGSOD datasets prove the superiority of our method. The source code is available at:https://github.com/Iwill-github/MORCNN.
中文摘要:提出的MO R-CNN框架通过异构特征提取、单模态监督和条件式标签融合等轻量化组件,有效解决多光谱定向目标检测难题,在多个基准数据集上验证了优越性能。
English Summary: The proposed MO R-CNN framework addresses multi-spectral oriented object detection challenges through lightweight components including heterogeneous feature extraction, single modality supervision, and conditional label fusion, demonstrating superior performance on benchmark datasets.

Authors:Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu
Title: Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
Abstract:
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations.To validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.
中文: 本文提出一种自蒸馏区域提议网络(SD-RPN),通过将多模态大语言模型的中间层注意力图转化为高质量区域提议,无需大量标注数据即可显著提升细粒度感知能力,在多项基准测试中实现超过10%的准确率提升。
English: This paper introduces a Self-Distilled Region Proposal Network (SD-RPN) that efficiently enhances multimodal large language models' fine-grained perception by generating high-quality region proposals from denoised attention maps, achieving significant accuracy improvements without costly annotations or full model fine-tuning.

Authors:Dat Thanh Tran, Khai Quang Tran, Khoi Anh Pham, Van Khu Vu, Dong Duc Do
Title: NeuFACO: Neural Focused Ant Colony Optimization for Traveling Salesman Problem
Abstract:
This study presents Neural Focused Ant Colony Optimization (NeuFACO), a non-autoregressive framework for the Traveling Salesman Problem (TSP) that combines advanced reinforcement learning with enhanced Ant Colony Optimization (ACO). NeuFACO employs Proximal Policy Optimization (PPO) with entropy regularization to train a graph neural network for instance-specific heuristic guidance, which is integrated into an optimized ACO framework featuring candidate lists, restricted tour refinement, and scalable local search. By leveraging amortized inference alongside ACO stochastic exploration, NeuFACO efficiently produces high-quality solutions across diverse TSP instances.
Chinese: 本研究提出NeuFACO框架,通过结合强化学习与改进蚁群优化算法,无需自回归即可为旅行商问题高效生成高质量解决方案。
English: This study introduces NeuFACO, a non-autoregressive framework that integrates reinforcement learning with enhanced Ant Colony Optimization to efficiently generate high-quality solutions for the Traveling Salesman Problem.

Authors:Ragib Amin Nihal, Benjamin Yen, Takeshi Ashizawa, Kazuhiro Nakadai
Title: Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment
Abstract:
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cross-correlation and Dynamic Time Warping assume simple drift patterns and provide no reliability measures. Meanwhile, recent deep learning models typically treat alignment as a binary classification task, overlooking inter-channel dependencies and uncertainty estimation. We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization. We extend BEATs encoders with cross-attention layers to model temporal relationships between channels. We also develop a confidence-weighted scoring function that uses the full prediction distribution instead of binary thresholding. Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline. On individual datasets, we achieved 0.14 MSE on ARU data (77% reduction) and 0.45 MSE on zebra finch data (18% reduction). The framework supports probabilistic temporal alignment, moving beyond point estimates. While validated in a bioacoustic context, the approach is applicable to a broader range of multi-channel audio tasks where alignment confidence is critical. Code available on: https://github.com/Ragib-Amin-Nihal/BEATsCA
中文: 本研究提出了一种结合交叉注意力机制与置信度加权评分的新型多通道音频对齐方法,在BioDCASE 2025挑战赛中显著降低了对齐误差,同时实现了不确定性量化,展现出优越性能。
English: This study introduces a novel multi-channel audio alignment method combining cross-attention mechanisms with confidence-weighted scoring, achieving superior performance in the BioDCASE 2025 challenge by significantly reducing alignment errors while providing uncertainty quantification.

Authors:Zhijie Qiao, Haowei Li, Zhong Cao, Henry X. Liu
Title: End2Race: Efficient End-to-End Imitation Learning for Real-Time F1Tenth Racing
Abstract:
F1Tenth is a widely adopted reduced-scale platform for developing and testing autonomous racing algorithms, hosting annual competitions worldwide. With high operating speeds, dynamic environments, and head-to-head interactions, autonomous racing requires algorithms that diverge from those in classical autonomous driving. Training such algorithms is particularly challenging: the need for rapid decision-making at high speeds severely limits model capacity. To address this, we propose End2Race, a novel end-to-end imitation learning algorithm designed for head-to-head autonomous racing. End2Race leverages a Gated Recurrent Unit (GRU) architecture to capture continuous temporal dependencies, enabling both short-term responsiveness and long-term strategic planning. We also adopt a sigmoid-based normalization function that transforms raw LiDAR scans into spatial pressure tokens, facilitating effective model training and convergence. The algorithm is extremely efficient, achieving an inference time of less than 0.5 milliseconds on a consumer-class GPU. Experiments in the F1Tenth simulator demonstrate that End2Race achieves a 94.2% safety rate across 2,400 overtaking scenarios, each with an 8-second time limit, and successfully completes overtakes in 59.2% of cases. This surpasses previous methods and establishes ours as a leading solution for the F1Tenth racing testbed. Code is available at https://github.com/michigan-traffic-lab/End2Race.
Chinese: End2Race是一种高效的端到端模仿学习算法,采用门控循环单元和激光雷达标记化技术,在F1Tenth自动驾驶竞速模拟中实现了94.2%的安全率,性能超越现有方法。
English: End2Race is an efficient end-to-end imitation learning algorithm using GRU and LiDAR tokenization, achieving a 94.2% safety rate and outperforming prior methods in F1Tenth autonomous racing simulations.

Authors:Faramarz Farhangian, Leandro A. Ensina, George D. C. Cavalcanti, Rafael M. O. Cruz
Title: DRES: Fake news detection by dynamic representation and ensemble selection
Abstract:
The rapid spread of information via social media has made text-based fake news detection critically important due to its societal impact. This paper presents a novel detection method called Dynamic Representation and Ensemble Selection (DRES) for identifying fake news based solely on text. DRES leverages instance hardness measures to estimate the classification difficulty for each news article across multiple textual feature representations. By dynamically selecting the textual representation and the most competent ensemble of classifiers for each instance, DRES significantly enhances prediction accuracy. Extensive experiments show that DRES achieves notable improvements over state-of-the-art methods, confirming the effectiveness of representation selection based on instance hardness and dynamic ensemble selection in boosting performance. Codes and data are available at: https://github.com/FFarhangian/FakeNewsDetection_DRES
中文: 本文提出了一种名为DRES的新型虚假新闻检测方法,该方法基于实例难度动态选择文本表示和分类器集成,相比现有方法显著提升了检测准确率。
English: This paper introduces DRES, a novel fake news detection method that dynamically selects textual representations and classifier ensembles based on instance hardness, achieving superior accuracy over existing approaches.

Authors:Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu, Georges El Fakhri, Xiaofeng Liu, Shijian Lu
Title: Rethinking Evaluation of Infrared Small Target Detection
Abstract:
As an essential vision task, infrared small target detection (IRSTD) has seen significant advancements through deep learning. However, critical limitations in current evaluation protocols impede further progress. First, existing methods rely on fragmented pixel- and target-level specific metrics, which fails to provide a comprehensive view of model capabilities. Second, an excessive emphasis on overall performance scores obscures crucial error analysis, which is vital for identifying failure modes and improving real-world system performance. Third, the field predominantly adopts dataset-specific training-testing paradigms, hindering the understanding of model robustness and generalization across diverse infrared scenarios. This paper addresses these issues by introducing a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation. These aim to offer a more thorough and rational hierarchical analysis framework, ultimately fostering the development of more effective and robust IRSTD models. An open-source toolkit has be released to facilitate standardized benchmarking.
中文摘要:本文针对红外小目标检测评估方法的三大局限,提出了融合像素与目标级指标的混合度量体系、系统化误差分析方法及跨数据集评估方案,旨在建立更全面的分层分析框架以促进模型发展。
English Summary: This paper identifies key limitations in current infrared small target detection evaluation methods and proposes a comprehensive framework integrating hybrid-level metrics, systematic error analysis, and cross-dataset evaluation to advance model development.

Authors:Rui Yang, Michael Fu, Chakkrit Tantithamthavorn, Chetan Arora, Gunel Gulmammadova, Joey Chua
Title: AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software
Abstract:
Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions--opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply--to as low as 12%--when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at https://github.com/awsm-research/AdaptiveGuard to support further research.
中文: 护栏对保护大语言模型免受越狱攻击至关重要,但现有系统难以应对新型威胁,因此我们提出AdaptiveGuard,一种自适应护栏,能检测新攻击并持续学习防御,实现高精度和快速适应。
English: Guardrails are essential for protecting Large Language Models from jailbreak attacks, but current systems struggle with new threats, prompting the development of AdaptiveGuard, an adaptive solution that detects novel attacks and learns to counter them, achieving high accuracy and rapid adaptation.

Authors:Devin R. Wright, Jisun An, Yong-Yeol Ahn
Title: Cognitive Linguistic Identity Fusion Score (CLIFS): A Scalable Cognition-Informed Approach to Quantifying Identity Fusion from Text
Abstract:
Quantifying identity fusion -- the psychological merging of self with another entity or abstract target (e.g., a religious group, political party, ideology, value, brand, belief, etc.) -- is vital for understanding a wide range of group-based human behaviors. We introduce the Cognitive Linguistic Identity Fusion Score (CLIFS), a novel metric that integrates cognitive linguistics with large language models (LLMs), which builds on implicit metaphor detection. Unlike traditional pictorial and verbal scales, which require controlled surveys or direct field contact, CLIFS delivers fully automated, scalable assessments while maintaining strong alignment with the established verbal measure. In benchmarks, CLIFS outperforms both existing automated approaches and human annotation. As a proof of concept, we apply CLIFS to violence risk assessment to demonstrate that it can improve violence risk assessment by more than 240%. Building on our identification of a new NLP task and early success, we underscore the need to develop larger, more diverse datasets that encompass additional fusion-target domains and cultural backgrounds to enhance generalizability and further advance this emerging area. CLIFS models and code are public at https://github.com/DevinW-sudo/CLIFS.
Chinese: 认知语言身份融合评分(CLIFS)是一种结合认知语言学与大语言模型的新型自动化度量方法,能有效量化身份融合,其性能超越现有方法,并在暴力风险评估中实现了超过240%的提升。
English: The Cognitive Linguistic Identity Fusion Score (CLIFS) is a novel automated metric that integrates cognitive linguistics with large language models to quantify identity fusion, outperforming existing methods and demonstrating a 240% improvement in violence risk assessment.

Authors:Md. Atabuzzaman, Ali Asgarov, Chris Thomas
Title: Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have achieved strong performance on vision-language tasks, particularly Visual Question Answering (VQA). While prior work has explored unimodal biases in VQA, the problem of selection bias in Multiple-Choice Question Answering (MCQA), where models may favor specific option tokens (e.g., "A") or positions, remains underexplored. In this paper, we investigate both the presence and nature of selection bias in LVLMs through fine-grained MCQA benchmarks spanning easy, medium, and hard difficulty levels, defined by the semantic similarity of the options. We further propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts and applies confidence-adaptive corrections to the model's output. Our method mitigates bias without retraining and is compatible with frozen LVLMs. Extensive experiments across several state-of-the-art models reveal consistent selection biases that intensify with task difficulty, and show that our mitigation approach significantly reduces bias while improving accuracy in challenging settings. This work offers new insights into the limitations of LVLMs in MCQA and presents a practical approach to improve their robustness in fine-grained visual reasoning. Datasets and code are available at: https://github.com/Atabuzzaman/Selection-Bias-of-LVLMs
大型视觉语言模型在多选题问答中存在随任务难度加剧的选择偏差,我们提出的对数级去偏方法无需重新训练即可有效缓解偏差并提升准确率。
Large Vision-Language Models exhibit consistent selection bias in Multiple-Choice Question Answering that escalates with task difficulty, and our proposed logit-level debiasing method effectively mitigates this bias without retraining while enhancing accuracy.

Authors:Kai Jiang, Zhengyan Shi, Dell Zhang, Hongyuan Zhang, Xuelong Li
Title: Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning
Abstract:
Class Incremental Learning (CIL) aims to continuously learn new categories while retaining the knowledge of old ones. Pre-trained models (PTMs) show promising capabilities in CIL. However, existing approaches that apply lightweight fine-tuning to backbones still induce parameter drift, thereby compromising the generalization capability of pre-trained models. Parameter drift can be conceptualized as a form of noise that obscures critical patterns learned for previous tasks. However, recent researches have shown that noise is not always harmful. For example, the large number of visual patterns learned from pre-training can be easily abused by a single task, and introducing appropriate noise can suppress some low-correlation features, thus leaving a margin for future tasks. To this end, we propose learning beneficial noise for CIL guided by information theory and propose Mixture of Noise (Min), aiming to mitigate the degradation of backbone generalization from adapting new tasks. Specifically, task-specific noise is learned from high-dimension features of new tasks. Then, a set of weights is adjusted dynamically for optimal mixture of different task noise. Finally, Min embeds the beneficial noise into the intermediate features to mask the response of inefficient patterns. Extensive experiments on six benchmark datasets demonstrate that Min achieves state-of-the-art performance in most incremental settings, with particularly outstanding results in 50-steps incremental settings. This shows the significant potential for beneficial noise in continual learning. Code is available at https://github.com/ASCIIJK/MiN-NeurIPS2025.
Chinese: 提出的噪声混合方法依据信息理论从新任务中学习有益噪声,通过动态混合并嵌入特征来缓解参数漂移,从而在类增量学习中保持预训练模型的泛化能力,在多个基准测试中取得了领先的性能。
English: The proposed Mixture of Noise (Min) method leverages information theory to learn beneficial noise from new tasks, which is dynamically mixed and embedded into features to mitigate parameter drift and preserve the generalization of pre-trained models in class incremental learning, achieving state-of-the-art performance across multiple benchmarks.

Authors:Dongdong Chen, Linlin Yao, Mengjun Liu, Zhenrong Shen, Yuqi Hu, Zhiyun Song, Shengyu Lu, Qian Wang, Dinggang Shen, Lichi Zhang
Title: Brain Connectivity Network Structure Learning For Brain Disorder Diagnosis
Abstract:
Recent studies in neuroscience highlight the significant potential of brain connectivity networks, which are commonly constructed from functional magnetic resonance imaging (fMRI) data for brain disorder diagnosis. Traditional brain connectivity networks are typically obtained using predefined methods that incorporate manually-set thresholds to estimate inter-regional relationships. However, such approaches often introduce redundant connections or overlook essential interactions, compromising the value of the constructed networks. Besides, the insufficiency of labeled data further increases the difficulty of learning generalized representations of intrinsic brain characteristics. To mitigate those issues, we propose a self-supervised framework to learn an optimal structure and representation for brain connectivity networks, focusing on individualized generation and optimization in an unsupervised manner. We firstly employ two existing whole-brain connectomes to adaptively construct their complementary brain network structure learner, and then introduce a multi-state graph-based encoder with a joint iterative learning strategy to simultaneously optimize both the generated network structure and its representation. By leveraging self-supervised pretraining on large-scale unlabeled brain connectivity data, our framework enables the brain connectivity network learner to generalize e ffectively to unseen disorders, while requiring only minimal finetuning of the encoder for adaptation to new diagnostic tasks. Extensive experiments on cross-dataset brain disorder diagnosis demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness and generalizability. The code is publicly available at https://github.com/neochen1/BCNSL.
Chinese: 本文提出了一种自监督框架,能够自适应地从功能磁共振成像数据中构建和优化脑连接网络,有效泛化至未知脑部疾病,仅需少量微调即可在跨数据集诊断中超越现有方法。
English: This paper introduces a self-supervised framework that adaptively constructs and optimizes brain connectivity networks from fMRI data, enabling effective generalization to unseen brain disorders with minimal fine-tuning and outperforming existing methods in cross-dataset diagnosis.

Authors:Auss Abbood, Zaiqiao Meng, Nigel Collier
Title: Time to Revist Exact Match
Abstract:
Temporal question answering is an established method for evaluating temporal reasoning in large language models. Expected answers are often numeric (e.g., dates or durations), yet model responses are evaluated like regular text with exact match (EM), unable to distinguish small from large errors. In this investigative work, we frame temporal question answering as a numerical estimation task to assess the shortcomings of EM. We introduce TempAnswerQA, a benchmark distilled from Test of Time and TempTabQA, where all questions require a numerical, temporal answer, allowing us to evaluate models beyond EM. We use the forecasting metrics symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE). With sMAPE, we find that error size and EM are decoupled. Models with low EM still have low sMAPE (both ~20%), and some models have high sMAPE despite high EM. Scaling errors by the deviation of the ground truth data with MASE reshuffles model rankings compared to EM, revealing gaps in models' understanding of temporal domain knowledge, especially when trained with synthetic data. Lastly, the models' most frequent error is to deviate by only $\pm1$ from the ground truth. sMAPE and MASE, unlike EM, adequately weight these errors. Our findings underscore the need for specialised metrics for temporal QA tasks. Code and data are available on https://github.com/aauss/temporal-answer-qa.
中文: 本研究将时序问答重构为数值估计任务,通过TempAnswerQA基准和sMAPE、MASE等预测指标,揭示了精确匹配评估的局限性,并证明需要专门指标来评估时序推理能力。
English: This study reframes temporal question answering as a numerical estimation task, introducing the TempAnswerQA benchmark and forecasting metrics like sMAPE and MASE to reveal limitations of exact match evaluation and demonstrate the need for specialized metrics in assessing temporal reasoning.

Authors:Pan Liu, Jinshi Liu
Title: When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
Abstract:
While significant advances exist in pseudo-label generation for semi-supervised semantic segmentation, pseudo-label selection remains understudied. Existing methods typically use fixed confidence thresholds to retain high-confidence predictions as pseudo-labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high-confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo-label selection as a convex optimization problem within the confidence distribution feature space, establishing sample-specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from low-reliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal, Cityscapes, and COCO benchmarks show that CSL performs favorably against state-of-the-art methods. Code and model weights are available at https://github.com/PanLiuCSU/CSL.
Chinese: 本文提出置信度可分离学习(CSL)方法,通过将伪标签选择构建为置信度分布特征空间中的凸优化问题来建立样本特定决策边界,并采用随机掩码从低可靠性区域学习上下文关系,在多个基准测试中展现出优于现有方法的性能。
English: This paper introduces Confidence Separable Learning (CSL), a novel method that formulates pseudo-label selection as a convex optimization problem to establish sample-specific decision boundaries and employs random masking to learn from low-reliability regions, demonstrating superior performance on major benchmarks compared to existing approaches.

Authors:Suorong Yang, Hongchao Yang, Suhan Guo, Furao Shen, Jian Zhao
Title: IPF-RDA: An Information-Preserving Framework for Robust Data Augmentation
Abstract:
Data augmentation is widely utilized as an effective technique to enhance the generalization performance of deep models. However, data augmentation may inevitably introduce distribution shifts and noises, which significantly constrain the potential and deteriorate the performance of deep networks. To this end, we propose a novel information-preserving framework, namely IPF-RDA, to enhance the robustness of data augmentations in this paper. IPF-RDA combines the proposal of (i) a new class-discriminative information estimation algorithm that identifies the points most vulnerable to data augmentation operations and corresponding importance scores; And (ii) a new information-preserving scheme that preserves the critical information in the augmented samples and ensures the diversity of augmented data adaptively. We divide data augmentation methods into three categories according to the operation types and integrate these approaches into our framework accordingly. After being integrated into our framework, the robustness of data augmentation methods can be enhanced and their full potential can be unleashed. Extensive experiments demonstrate that although being simple, IPF-RDA consistently improves the performance of numerous commonly used state-of-the-art data augmentation methods with popular deep models on a variety of datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, CUHK03, Market1501, Oxford Flower, and MNIST, where its performance and scalability are stressed. The implementation is available at https://github.com/Jackbrocp/IPF-RDA.
中文摘要:本文提出IPF-RDA框架,通过识别易受数据增强影响的关键点并保留重要信息,有效提升多种数据增强方法的鲁棒性,在多个数据集上显著改善深度模型的性能表现。
English Summary: The paper introduces IPF-RDA, a novel framework that enhances data augmentation robustness by identifying vulnerable data points and preserving critical information, consistently improving performance across various datasets and models.

Authors:Wenxin Li, Kunyu Peng, Di Wen, Ruiping Liu, Mengfei Duan, Kai Luo, Kailun Yang
Title: Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence
Abstract:
Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at https://github.com/mylwx/ActiSeg-NL.
中文摘要:本研究首次针对标签噪声下的动作视频对象分割建立基准,通过调整学习策略和引入并行掩码头机制,有效应对文本和掩码标注噪声,提升了模型的鲁棒性。
English Summary: This study introduces the first benchmark for action-based video object segmentation under label noise, addressing both textual and mask annotation noise through adapted learning strategies and a novel Parallel Mask Head Mechanism to enhance robustness.

Authors:Simone Ricci, Niccolò Biondi, Federico Pernici, Ioannis Patras, Alberto Del Bimbo
Title: $\boldsymbolλ$-Orthogonality Regularization for Compatible Representation Learning
Abstract:
Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating communication between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while preserving the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely $λ$-orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model's zero-shot performance and ensures compatibility across model updates. Code available at: https://github.com/miccunifi/lambda_orthogonality
中文: 本文提出了一种λ正交正则化方法,通过结合仿射变换与宽松正交约束,在保持原始表征结构的同时实现模型更新间的表示对齐与分布适应。
English: This paper introduces a λ-orthogonality regularization method that combines affine transformations with relaxed orthogonality constraints to align model representations across updates while preserving both adaptability and original learned features.

Authors:Changyu Zeng, Yifan Wang, Zimu Wang, Wei Wang, Zhengni Yang, Muyi Bao, Jiming Xiao, Ahn Nguyen, Yutao Yue
Title: NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities
Abstract:
Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.
中文: 当前二维多模态模型在视觉语言任务中表现出色,但在三维空间推理中因缺乏细粒度数值标注而面临挑战,为此推出NUMINA基准,通过自动化标注和评估来增强多模态数值理解能力。
English: Recent 2D multimodal models excel in vision-language tasks but face challenges in 3D spatial reasoning due to limited fine-grained numerical annotations, prompting the introduction of NUMINA benchmark to enhance multimodal numerical understanding through automated annotations and evaluations.

Authors:Changyu Zeng, Yifan Wang, Zimu Wang, Wei Wang, Zhengni Yang, Muyi Bao, Jiming Xiao, Anh Nguyen, Yutao Yue
Title: NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities
Abstract:
Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs' ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from https://github.com/fengshun124/NUMINA.
中文: 当前二维多模态模型在视觉语言任务中表现出色,但在三维空间推理中因缺乏细粒度数值标注而面临挑战,为此推出NUMINA基准,通过自动化标注和评估来增强多模态数值理解能力。
English: Recent 2D multimodal models excel in vision-language tasks but face challenges in 3D spatial reasoning due to limited fine-grained numerical annotations, prompting the introduction of NUMINA benchmark to enhance multimodal numerical understanding through automated annotations and evaluations.

Authors:Weiran Chen, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu
Title: DA-Font: Few-Shot Font Generation via Dual-Attention Hybrid Integration
Abstract:
Few-shot font generation aims to create new fonts with a limited number of glyph references. It can be used to significantly reduce the labor cost of manual font design. However, due to the variety and complexity of font styles, the results generated by existing methods often suffer from visible defects, such as stroke errors, artifacts and blurriness. To address these issues, we propose DA-Font, a novel framework which integrates a Dual-Attention Hybrid Module (DAHM). Specifically, we introduce two synergistic attention blocks: the component attention block that leverages component information from content images to guide the style transfer process, and the relation attention block that further refines spatial relationships through interacting the content feature with both original and stylized component-wise representations. These two blocks collaborate to preserve accurate character shapes and stylistic textures. Moreover, we also design a corner consistency loss and an elastic mesh feature loss to better improve geometric alignment. Extensive experiments show that our DA-Font outperforms the state-of-the-art methods across diverse font styles and characters, demonstrating its effectiveness in enhancing structural integrity and local fidelity. The source code can be found at \href{https://github.com/wrchen2001/DA-Font}{\textit{https://github.com/wrchen2001/DA-Font}}.
Chinese: DA-Font通过双注意力机制和优化的损失函数,有效解决了少样本字体生成中的笔画错误和结构失真问题,显著提升了字体生成的视觉质量。
English: DA-Font introduces a dual-attention framework and specialized loss functions to overcome defects in few-shot font generation, achieving superior results in style and structural accuracy.

Authors:Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun
Title: Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features
Abstract:
Positional encoding is essential for supplementing transformer with positional information of tokens. Existing positional encoding methods demand predefined token/feature order, rendering them unsuitable for real-world data with non-sequential yet causally-related features. To address this limitation, we propose CAPE, a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG) using generalized structural equation modeling. The DAG is then embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach that effectively captures two important causal graph properties (causal strength & causal specificity). This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer's self-attention mechanism. Theoretical analysis reveals that CAPE-generated rotary positional encodings possess three valuable properties for enhanced self-attention, including causal distance-induced attenuation, causal generality-induced attenuation, and robustness to positional disturbances. We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. Our code is available at https://github.com/Catchxu/CAPE.
中文: CAPE提出了一种新颖的位置编码方法,将非序列特征建模为双曲空间中的因果图,通过因果感知特性增强Transformer的自注意力机制。
English: CAPE introduces a novel positional encoding method that models non-sequential features as a causal graph embedded in hyperbolic space, enhancing transformers with causality-aware properties for improved self-attention.

Authors:Junjie Zhou, Haijun Xiong, Junhao Lu, Ziyu Lin, Bin Feng
Title: CGTGait: Collaborative Graph and Transformer for Gait Emotion Recognition
Abstract:
Skeleton-based gait emotion recognition has received significant attention due to its wide-ranging applications. However, existing methods primarily focus on extracting spatial and local temporal motion information, failing to capture long-range temporal representations. In this paper, we propose \textbf{CGTGait}, a novel framework that collaboratively integrates graph convolution and transformers to extract discriminative spatiotemporal features for gait emotion recognition. Specifically, CGTGait consists of multiple CGT blocks, where each block employs graph convolution to capture frame-level spatial topology and the transformer to model global temporal dependencies. Additionally, we introduce a Bidirectional Cross-Stream Fusion (BCSF) module to effectively aggregate posture and motion spatiotemporal features, facilitating the exchange of complementary information between the two streams. We evaluate our method on two widely used datasets, Emotion-Gait and ELMD, demonstrating that our CGTGait achieves state-of-the-art or at least competitive performance while reducing computational complexity by approximately \textbf{82.2\%} (only requiring 0.34G FLOPs) during testing. Code is available at \small{https://github.com/githubzjj1/CGTGait.}
中文: CGTGait框架结合图卷积与Transformer提取步态情感识别的时空特征,在显著降低82.2%计算量的同时实现了最优性能。
English: The proposed CGTGait framework integrates graph convolution and transformers to capture spatiotemporal features for gait emotion recognition, achieving state-of-the-art performance while reducing computational complexity by 82.2%.

Authors:Shipeng Liu, Zhonglin Zhang, Dengfeng Chen, Liang Zhao
Title: Describe-to-Score: Text-Guided Efficient Image Complexity Assessment
Abstract:
Accurately assessing image complexity (IC) is critical for computer vision, yet most existing methods rely solely on visual features and often neglect high-level semantic information, limiting their accuracy and generalization. We introduce vision-text fusion for IC modeling. This approach integrates visual and textual semantic features, increasing representational diversity. It also reduces the complexity of the hypothesis space, which enhances both accuracy and generalization in complexity assessment. We propose the D2S (Describe-to-Score) framework, which generates image captions with a pre-trained vision-language model. We propose the feature alignment and entropy distribution alignment mechanisms, D2S guides semantic information to inform complexity assessment while bridging the gap between vision and text modalities. D2S utilizes multi-modal information during training but requires only the vision branch during inference, thereby avoiding multi-modal computational overhead and enabling efficient assessment. Experimental results demonstrate that D2S outperforms existing methods on the IC9600 dataset and maintains competitiveness on no-reference image quality assessment (NR-IQA) benchmark, validating the effectiveness and efficiency of multi-modal fusion in complexity-related tasks. Code is available at: https://github.com/xauat-liushipeng/D2S
中文摘要:D2S框架通过视觉-文本融合整合视觉与语义特征,在提升图像复杂度评估准确性和泛化能力的同时,保持了推理阶段的高效性。
English Summary: The D2S framework enhances image complexity assessment by integrating visual and textual semantic features through vision-text fusion, improving accuracy and generalization while maintaining computational efficiency during inference.

Authors:Minji Heo, Simon S. Woo
Title: FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection
Abstract:
Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \textbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \textbf{58.83\%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\footnote{https://github.com/minjihh/FakeChain}.
中文: 通过组合不同生成方法创建的多步骤混合深度伪造对检测模型构成重大挑战,这些模型因依赖最终阶段痕迹而非累积操作特征而难以泛化。
English: Multi-step hybrid deepfakes created by combining different generation methods present significant challenges to detection models, which often fail to generalize due to reliance on final-stage artifacts rather than cumulative manipulation traces.

Authors:Minji Heo, Simon S. Woo
Title: FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection
Abstract:
Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \textbf{FakeChain}, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \textbf{58.83\%} when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\footnote{https://github.com/minjihh/FakeChain}.
中文: 通过组合不同生成方法创建的多步骤混合深度伪造对检测模型构成重大挑战,这些模型因依赖最终阶段痕迹而非累积操作特征而难以泛化。
English: Multi-step hybrid deepfakes created by combining different generation methods present significant challenges to detection models, which often fail to generalize due to reliance on final-stage artifacts rather than cumulative manipulation traces.

Authors:Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang
Title: From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Token's Nature
Abstract:
Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in https://github.com/starriver030515/HAPO.
中文摘要: 本文提出HAPO算法,通过基于令牌熵的自适应温度采样和非对称剪裁等创新组件,在强化学习中实现细粒度优化,在不同规模模型上均优于现有方法。
English Summary: This paper introduces HAPO, a token-aware reinforcement learning algorithm that dynamically adapts optimization based on token entropy through novel components including adaptive temperature sampling and asymmetric clipping, consistently outperforming existing methods across model scales.

Authors:Antonio Scardace, Lemuel Puglisi, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì
Title: A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis
Abstract:
Deep generative models have emerged as a transformative tool in medical imaging, offering substantial potential for synthetic data generation. However, recent empirical studies highlight a critical vulnerability: these models can memorize sensitive training data, posing significant risks of unauthorized patient information disclosure. Detecting memorization in generative models remains particularly challenging, necessitating scalable methods capable of identifying training data leakage across large sets of generated samples. In this work, we propose DeepSSIM, a novel self-supervised metric for quantifying memorization in generative models. DeepSSIM is trained to: i) project images into a learned embedding space and ii) force the cosine similarity between embeddings to match the ground-truth SSIM (Structural Similarity Index) scores computed in the image space. To capture domain-specific anatomical features, training incorporates structure-preserving augmentations, allowing DeepSSIM to estimate similarity reliably without requiring precise spatial alignment. We evaluate DeepSSIM in a case study involving synthetic brain MRI data generated by a Latent Diffusion Model (LDM) trained under memorization-prone conditions, using 2,195 MRI scans from two publicly available datasets (IXI and CoRR). Compared to state-of-the-art memorization metrics, DeepSSIM achieves superior performance, improving F1 scores by an average of +52.03% over the best existing method. Code and data of our approach are publicly available at the following link: https://github.com/brAIn-science/DeepSSIM.
Chinese: DeepSSIM是一种新颖的自监督度量方法,通过将图像投影至与结构相似性对齐的嵌入空间,有效量化医学影像生成模型中的记忆效应,在检测训练数据泄露方面展现出优于现有方法的性能。
English: DeepSSIM is a novel self-supervised metric that effectively quantifies memorization in medical imaging generative models by projecting images into an embedding space aligned with structural similarity, demonstrating superior performance over existing methods in detecting training data leakage.

Authors:Jun Rong Brian Chong, Yixuan Tang, Anthony K. H. Tung
Title: MPCG: Multi-Round Persona-Conditioned Generation for Modeling the Evolution of Misinformation with LLMs
Abstract:
Misinformation evolves as it spreads, shifting in language, framing, and moral emphasis to adapt to new audiences. However, current misinformation detection approaches implicitly assume that misinformation is static. We introduce MPCG, a multi-round, persona-conditioned framework that simulates how claims are iteratively reinterpreted by agents with distinct ideological perspectives. Our approach uses an uncensored large language model (LLM) to generate persona-specific claims across multiple rounds, conditioning each generation on outputs from the previous round, enabling the study of misinformation evolution. We evaluate the generated claims through human and LLM-based annotations, cognitive effort metrics (readability, perplexity), emotion evocation metrics (sentiment analysis, morality), clustering, feasibility, and downstream classification. Results show strong agreement between human and GPT-4o-mini annotations, with higher divergence in fluency judgments. Generated claims require greater cognitive effort than the original claims and consistently reflect persona-aligned emotional and moral framing. Clustering and cosine similarity analyses confirm semantic drift across rounds while preserving topical coherence. Feasibility results show a 77% feasibility rate, confirming suitability for downstream tasks. Classification results reveal that commonly used misinformation detectors experience macro-F1 performance drops of up to 49.7%. The code is available at https://github.com/bcjr1997/MPCG
中文摘要:MPCG框架通过多轮人物角色条件化重构模拟虚假信息的演变过程,揭示现有检测器对动态适配的虚假信息存在高达49.7%的性能下降。
English Summary: The MPCG framework simulates misinformation evolution through persona-conditioned reinterpretation across multiple rounds, demonstrating that current detectors fail against dynamically adapted claims with significant performance drops.

Authors:Ji Soo Lee, Byungoh Ko, Jaewon Cho, Howoong Lee, Jaewoon Byun, Hyunwoo J. Kim
Title: Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization
Abstract:
In text-video retrieval, auxiliary captions are often used to enhance video understanding, bridging the gap between the modalities. While recent advances in multi-modal large language models (MLLMs) have enabled strong zero-shot caption generation, we observe that such captions tend to be generic and indistinguishable across visually similar videos, limiting their utility for fine-grained retrieval. Moreover, conventional captioning approaches are typically evaluated using language generation metrics, such as BLEU, which are not typically tailored for retrieval tasks that require making discriminative distinctions between candidates. To address this, we propose $\textbf{CaRe-DPO}$, a retrieval framework that directly optimizes caption generation using retrieval relevance scores. At its core is Dual-Group Direct Preference Optimization (DG-DPO), a novel learning strategy that supervises captioning by modeling preferences across groups of distinct video and caption pairs. In addition, we present an MLLM-based retrieval model that incorporates role-embeddings to better distinguish between textual inputs with different functional roles, such as an auxiliary caption and a text query. Through extensive experiments, we demonstrate that CaRe-DPO significantly enhances retrieval performance by effectively leveraging auxiliary knowledge to generate fine-grained captions for retrieval. Code is available at https://github.com/mlvlab/CaReDPO.
中文: 提出的CaRe-DPO框架通过检索相关性分数直接优化字幕生成,并采用跨视频-字幕对建模偏好的新颖学习策略,有效提升细粒度文本-视频检索性能。
English: The proposed CaRe-DPO framework enhances text-video retrieval by directly optimizing caption generation through retrieval relevance scores and a novel learning strategy that models preferences across video-caption pairs, significantly improving fine-grained retrieval performance.

Authors:Zirui Wang, Jiayi Zhang, Tianwei Guan, Yuhan Zhou, Xingyuan Li, Minjing Dong, Jinyuan Liu
Title: Efficient Rectified Flow for Image Fusion
Abstract:
Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality. Code is available at https://github.com/zirui0625/RFfusion.
中文: RFfusion 是一种高效的图像融合一步扩散模型,通过结合整流流和任务特定的变分自编码器,在保持高质量融合效果的同时显著提升了推理速度。
English: RFfusion is an efficient one-step diffusion model for image fusion that integrates Rectified Flow and a task-specific VAE to enhance inference speed while maintaining high-quality results.

Authors:Burak Satar, Zhixin Ma, Patrick A. Irawan, Wilfried A. Mulyawan, Jing Jiang, Ee-Peng Lim, Chong-Wah Ngo
Title: Seeing Culture: A Benchmark for Visual Reasoning and Grounding
Abstract:
Multimodal vision-language models (VLMs) have made substantial progress in various tasks that require a combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures. In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option with multiple-choice visual question answering (VQA), and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. The SCB benchmark comprises 1,065 images that capture 138 cultural artifacts across five categories from seven Southeast Asia countries, whose diverse cultures are often overlooked, accompanied by 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and highlights the disparity between visual reasoning and spatial grounding in culturally nuanced scenarios. The SCB serves as a crucial benchmark for identifying these shortcomings, thereby guiding future developments in the field of cultural reasoning. https://github.com/buraksatar/SeeingCulture
中文摘要:Seeing Culture Benchmark(SCB)通过两阶段评估方法,要求视觉语言模型先回答文化选择题再分割相关文物,利用1065张东南亚多元文化图像解决了现有数据集文化推理能力不足的问题。
English Summary: The Seeing Culture Benchmark (SCB) introduces a two-stage evaluation method requiring vision-language models to first answer cultural multiple-choice questions and then segment relevant artifacts, addressing the lack of cultural reasoning in existing datasets through 1,065 culturally diverse Southeast Asian images.

Authors:Haijin Zeng, Xuan Lu, Yurong Zhang, Yongyong Chen, Jingyong Su, Jie Liu
Title: SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging
Abstract:
Humans learn in two complementary ways: a slow, cumulative process that builds broad, general knowledge, and a fast, on-the-fly process that captures specific experiences. Existing deep-unfolding methods for spectral compressive imaging (SCI) mirror only the slow component-relying on heavy pre-training with many unfolding stages-yet they lack the rapid adaptation needed to handle new optical configurations. As a result, they falter on out-of-distribution cameras, especially in bespoke spectral setups unseen during training. This depth also incurs heavy computation and slow inference. To bridge this gap, we introduce SlowFast-SCI, a dual-speed framework seamlessly integrated into any deep unfolding network beyond SCI systems. During slow learning, we pre-train or reuse a priors-based backbone and distill it via imaging guidance into a compact fast-unfolding model. In the fast learning stage, lightweight adaptation modules are embedded within each block and trained self-supervised at test time via a dual-domain loss-without retraining the backbone. To the best of our knowledge, SlowFast-SCI is the first test-time adaptation-driven deep unfolding framework for efficient, self-adaptive spectral reconstruction. Its dual-stage design unites offline robustness with on-the-fly per-sample calibration-yielding over 70% reduction in parameters and FLOPs, up to 5.79 dB PSNR improvement on out-of-distribution data, preserved cross-domain adaptability, and a 4x faster adaptation speed. In addition, its modularity integrates with any deep-unfolding network, paving the way for self-adaptive, field-deployable imaging and expanded computational imaging modalities. Code and models are available at https://github.com/XuanLu11/SlowFast-SCI.
中文:SlowFast-SCI提出了一种双速框架,将预训练的鲁棒性与轻量级测试时自适应相结合,在分布外光谱成像数据上实现了显著效率提升和性能改进。
English: SlowFast-SCI introduces a dual-speed framework that combines pre-trained robustness with lightweight test-time adaptation, achieving significant efficiency gains and improved performance on out-of-distribution spectral imaging data.

Authors:Joe Barrow
Title: CommonForms: A Large, Diverse Dataset for Form Field Detection
Abstract:
This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than $500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at https://github.com/jbarrow/commonforms
中文: 本文提出了基于网络PDF构建的大规模表单字段检测数据集CommonForms,并开发了FFDNet模型系列,以低于500美元的训练成本实现高精度检测,其性能优于商业解决方案。
English: This paper introduces CommonForms, a large-scale dataset for form field detection built from web PDFs, and presents FFDNet models that achieve high precision at low cost, outperforming commercial solutions.

Authors:Dev Gurung, Shiva Raj Pokhrel
Title: sat-QFL: Secure Quantum Federated Learning for Low Orbit Satellites
Abstract:
Low Earth orbit (LEO) constellations violate core assumptions of standard (quantum) federated learning (FL): client-server connectivity is intermittent, participation is time varying, and latency budgets are strict. We present sat-QFL, a hierarchical, access aware quantum federated learning (QFL) framework that partitions satellites into primary (ground connected) and secondary as inter-satellite links (ISL-only) roles, and schedules sequential, simultaneous, or asynchronous edge training aligned with visibility windows. For quantum-resilient confidentiality and integrity, sat-QFL integrates quantum key distribution (QKD) based key establishment with authenticated encryption for model exchange; we also assess teleportation as a feasibility primitive for quantum state transfer. Using derived constellation traces and QFL workloads (Qiskit), we show that sat-QFL sustains robust aggregation under varying participation and reduces communication bottlenecks with modest security overhead. Our implementation and results are available at https://github.com/s222416822/satQFL.
中文:sat-QFL框架通过分层组织卫星和调度适应性训练模式,解决了低轨卫星星座中间歇性连接和严格延迟的挑战,同时集成量子安全措施以最小开销确保通信保密性。
English: The sat-QFL framework addresses the challenges of intermittent connectivity and strict latency in LEO satellite constellations by hierarchically organizing satellites and scheduling adaptable training modes, while integrating quantum security measures to ensure confidentiality with minimal overhead.

Authors:Mohamed Eltahir, Osamah Sarraj, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammed Khurd, Mohammed Bremoo, Tanveer Hussain
Title: AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks
Abstract:
Video-to-text and text-to-video retrieval are dominated by English benchmarks (e.g. DiDeMo, MSR-VTT) and recent multilingual corpora (e.g. RUDDER), yet Arabic remains underserved, lacking localized evaluation metrics. We introduce a three-stage framework, AutoArabic, utilizing state-of-the-art large language models (LLMs) to translate non-Arabic benchmarks into Modern Standard Arabic, reducing the manual revision required by nearly fourfold. The framework incorporates an error detection module that automatically flags potential translation errors with 97% accuracy. Applying the framework to DiDeMo, a video retrieval benchmark produces DiDeMo-AR, an Arabic variant with 40,144 fluent Arabic descriptions. An analysis of the translation errors is provided and organized into an insightful taxonomy to guide future Arabic localization efforts. We train a CLIP-style baseline with identical hyperparameters on the Arabic and English variants of the benchmark, finding a moderate performance gap (about 3 percentage points at Recall@1), indicating that Arabic localization preserves benchmark difficulty. We evaluate three post-editing budgets (zero/ flagged-only/ full) and find that performance improves monotonically with more post-editing, while the raw LLM output (zero-budget) remains usable. To ensure reproducibility to other languages, we made the code available at https://github.com/Tahaalshatiri/AutoArabic.
中文:AutoArabic框架利用先进的大型语言模型将视频文本基准自动翻译成阿拉伯语,准确率高,创建了如DiDeMo-AR等本地化数据集,在保持基准难度的同时显著减少了人工编辑需求。
English: The AutoArabic framework uses advanced LLMs to automatically translate video-text benchmarks into Arabic with high accuracy, creating localized datasets like DiDeMo-AR that preserve benchmark difficulty while significantly reducing manual editing needs.

Authors:Zhengri Wu, Yiran Wang, Yu Wen, Zeyu Zhang, Biao Wu, Hao Tang
Title: StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes
Abstract:
Underwater stereo depth estimation provides accurate 3D geometry for robotics tasks such as navigation, inspection, and mapping, offering metric depth from low-cost passive cameras while avoiding the scale ambiguity of monocular methods. However, existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foundation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale-ambiguous monocular priors with locally metric yet photometrically fragile stereo correspondences. To address these challenges, we propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refinement module. We further introduce dynamic LoRA adaptation for efficient rank selection and pre-training on the synthetic UW-StereoDepth-40K dataset to enhance robustness under diverse underwater conditions. Comprehensive evaluations on both simulated and real-world benchmarks show improvements of 6.11% on TartanAir and 5.12% on SQUID compared to state-of-the-art methods, while real-world deployment with the BlueROV2 robot further demonstrates the consistent robustness of our approach. Code: https://github.com/AIGeeksGroup/StereoAdapter. Website: https://aigeeksgroup.github.io/StereoAdapter.
中文: StereoAdapter是一种参数高效的自监督框架,通过结合LoRA适配的单目基础编码器和立体细化模块,在多种水下环境中提升了深度估计的鲁棒性并实现了最先进的性能。
English: StereoAdapter is a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with stereo refinement, achieving state-of-the-art underwater depth estimation performance while enhancing robustness across diverse conditions.

Authors:Francesco Argenziano, Miguel Saavedra-Ruiz, Sacha Morin, Daniele Nardi, Liam Paull
Title: Dynamic Objects Relocalization in Changing Environments with Flow Matching
Abstract:
Task and motion planning are long-standing challenges in robotics, especially when robots have to deal with dynamic environments exhibiting long-term dynamics, such as households or warehouses. In these environments, long-term dynamics mostly stem from human activities, since previously detected objects can be moved or removed from the scene. This adds the necessity to find such objects again before completing the designed task, increasing the risk of failure due to missed relocalizations. However, in these settings, the nature of such human-object interactions is often overlooked, despite being governed by common habits and repetitive patterns. Our conjecture is that these cues can be exploited to recover the most likely objects' positions in the scene, helping to address the problem of unknown relocalization in changing environments. To this end we propose FlowMaps, a model based on Flow Matching that is able to infer multimodal object locations over space and time. Our results present statistical evidence to support our hypotheses, opening the way to more complex applications of our approach. The code is publically available at https://github.com/Fra-Tsuna/flowmaps
中文摘要:在家庭等动态环境中,任务与运动规划因人为移动物体而面临挑战,但FlowMaps通过分析人类交互模式来预测物体最可能出现的位置,有效解决了物体重定位问题。
English Summary: Task and motion planning in dynamic environments like households is challenging due to human-induced object movements, but FlowMaps addresses this by predicting likely object positions using human interaction patterns.

Authors:Josias K. Moukpe, Philip K. Chan, Ming Zhang
Title: Highly Imbalanced Regression with Tabular Data in SEP and Other Applications
Abstract:
We investigate imbalanced regression with tabular data that have an imbalance ratio larger than 1,000 ("highly imbalanced"). Accurately estimating the target values of rare instances is important in applications such as forecasting the intensity of rare harmful Solar Energetic Particle (SEP) events. For regression, the MSE loss does not consider the correlation between predicted and actual values. Typical inverse importance functions allow only convex functions. Uniform sampling might yield mini-batches that do not have rare instances. We propose CISIR that incorporates correlation, Monotonically Decreasing Involution (MDI) importance, and stratified sampling. Based on five datasets, our experimental results indicate that CISIR can achieve lower error and higher correlation than some recent methods. Also, adding our correlation component to other recent methods can improve their performance. Lastly, MDI importance can outperform other importance functions. Our code can be found in https://github.com/Machine-Earning/CISIR.
中文: 本研究提出CISIR方法,针对高度不平衡的回归问题,通过结合相关性分析、单调递减对合重要性函数和分层抽样,在多个数据集上实现了比现有方法更低的误差和更高的相关性。
English: The study introduces CISIR, a novel method for highly imbalanced regression that integrates correlation, monotonically decreasing involution importance, and stratified sampling, demonstrating superior performance with lower error and higher correlation compared to existing approaches on multiple datasets.

Authors:Yunsoo Kim, Michal W. S. Ong, Alex Shavick, Honghan Wu, Adam P. Levine
Title: HARE: an entity and relation centric evaluation framework for histopathology reports
Abstract:
Medical domain automated text generation is an active area of research and development; however, evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking, e.g. histopathology. We propose HARE (Histopathology Automated Report Evaluation), a novel entity and relation centric framework, composed of a benchmark dataset, a named entity recognition (NER) model, a relation extraction (RE) model, and a novel metric, which prioritizes clinically relevant content by aligning critical histopathology entities and relations between reference and generated reports. To develop the HARE benchmark, we annotated 813 de-identified clinical diagnostic histopathology reports and 652 histopathology reports from The Cancer Genome Atlas (TCGA) with domain-specific entities and relations. We fine-tuned GatorTronS, a domain-adapted language model to develop HARE-NER and HARE-RE which achieved the highest overall F1-score (0.915) among the tested models. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, as well as radiology metrics such as RadGraph-XL, with the highest correlation and the best regression to expert evaluations (higher than the second best method, GREEN, a large language model based radiology report evaluator, by Pearson $r = 0.168$, Spearman $ρ= 0.161$, Kendall $τ= 0.123$, $R^2 = 0.176$, $RMSE = 0.018$). We release HARE, datasets, and the models at https://github.com/knowlab/HARE to foster advancements in histopathology report generation, providing a robust framework for improving the quality of reports.
中文: HARE框架提出了一种新颖的基于实体和关系的组织病理学报告评估方法,通过关注临床相关内容,在专家评估相关性上显著优于现有指标。
English: The HARE framework introduces a novel entity and relation-based evaluation system for histopathology report generation, outperforming existing metrics by prioritizing clinically relevant content and demonstrating superior correlation with expert assessments.

Authors:Karan Kendre
Title: Machine Learning for Quantum Noise Reduction
Abstract:
Quantum noise fundamentally limits the utility of near-term quantum devices, making error mitigation essential for practical quantum computation. While traditional quantum error correction codes require substantial qubit overhead and complex syndrome decoding, we propose a machine learning approach that directly reconstructs clean quantum states from noisy density matrices without additional qubits. We formulate quantum noise reduction as a supervised learning problem using a convolutional neural network (CNN) autoencoder architecture with a novel fidelity-aware composite loss function. Our method is trained and evaluated on a comprehensive synthetic dataset of 10,000 density matrices derived from random 5-qubit quantum circuits, encompassing five noise types (depolarizing, amplitude damping, phase damping, bit-flip, and mixed noise) across four intensity levels (0.05-0.20). The CNN successfully reconstructs quantum states across all noise conditions, achieving an average fidelity improvement from 0.298 to 0.774 (Δ = 0.476). Notably, the model demonstrates superior performance on complex mixed noise scenarios and higher noise intensities, with mixed noise showing the highest corrected fidelity (0.807) and improvement (0.567). The approach effectively preserves both diagonal elements (populations) and off-diagonal elements (quantum coherences), making it suitable for entanglement-dependent quantum algorithms. While phase damping presents fundamental information-theoretic limitations, our results suggest that CNN-based density matrix reconstruction offers a promising, resource-efficient alternative to traditional quantum error correction for NISQ-era devices. This data-driven approach could enable practical quantum advantage with fewer physical qubits than conventional error correction schemes require.
中文摘要:本研究提出一种基于卷积神经网络自编码器的机器学习方法,可直接从含噪密度矩阵重构纯净量子态,无需额外量子比特即可在各种噪声类型下实现显著保真度提升,为近期量子设备提供了比传统纠错方案更资源高效的解决方案。
English Summary: This study introduces a machine learning method using a CNN autoencoder to directly reconstruct clean quantum states from noisy density matrices, achieving significant fidelity improvements across various noise types without requiring additional qubits, offering a resource-efficient alternative to traditional quantum error correction for near-term devices.

Authors:Huaiyu Chen, Fahed Hassanat, Robert Laganiere, Martin Bouchard
Title: mRadNet: A Compact Radar Object Detector with MetaFormer
Abstract:
Frequency-modulated continuous wave radars have gained increasing popularity in the automotive industry. Its robustness against adverse weather conditions makes it a suitable choice for radar object detection in advanced driver assistance systems. These real-time embedded systems have requirements for the compactness and efficiency of the model, which have been largely overlooked in previous work. In this work, we propose mRadNet, a novel radar object detection model with compactness in mind. mRadNet employs a U-net style architecture with MetaFormer blocks, in which separable convolution and attention token mixers are used to capture both local and global features effectively. More efficient token embedding and merging strategies are introduced to further facilitate the lightweight design. The performance of mRadNet is validated on the CRUW dataset, improving state-of-the-art performance with the least number of parameters and FLOPs.
Chinese: 提出的mRadNet模型采用U-net架构与MetaFormer模块,实现了紧凑高效的雷达目标检测系统,在CRUW数据集上以最少的参数和计算量取得了领先的性能。
English: The proposed mRadNet model introduces a compact and efficient radar object detection system using a U-net architecture with MetaFormer blocks, achieving state-of-the-art performance on the CRUW dataset with minimal parameters and computational cost.

Authors:Juhani Merilehto
Title: A 200-Line Python Micro-Benchmark Suite for NISQ Circuit Compilers
Abstract:
We present microbench.py, a compact (approx. 200 lines) Python script that automates the collection of key compiler metrics, i.e., gate depth, two-qubit-gate count, wall-clock compilation time, and memory footprint, across multiple open-source quantum circuit transpilers. The suite ships with six didactic circuits (3 to 8 qubits) implementing fundamental quantum algorithms and supports Qiskit, tket, Cirq, and the Qiskit-Braket provider; in this paper we showcase results for Qiskit 0.46 and Braket 1.16. The entire run completes in under three minutes on a laptop, emits a single CSV plus publisheable plot, and reproduces the figure here with one command. We release the code under the MIT licence to serve as a quick-start regression harness for NISQ compiler research.
Chinese: microbench.py脚本可自动收集量子电路转换器的关键编译器指标,支持多种平台,在笔记本电脑上三分钟内即可完成分析。
English: The microbench.py script automates the collection of essential compiler metrics for quantum circuit transpilers, supporting multiple platforms and completing analysis in under three minutes on a laptop.

Authors:Luca Della Libera, Cem Subakan, Mirco Ravanelli
Title: FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation
Abstract:
Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications. We present FocalCodec-Stream, a hybrid codec based on focal modulation that compresses speech into a single binary codebook at 0.55 - 0.80 kbps with a theoretical latency of 80 ms. Our approach combines multi-stage causal distillation of WavLM with targeted architectural improvements, including a lightweight refiner module that enhances quality under latency constraints. Experiments show that FocalCodec-Stream outperforms existing streamable codecs at comparable bitrates, while preserving both semantic and acoustic information. The result is a favorable trade-off between reconstruction quality, downstream task performance, latency, and efficiency. Code and checkpoints will be released at https://github.com/lucadellalib/focalcodec.
中文: FocalCodec-Stream是一种基于焦点调制的新型混合神经音频编解码器,在低比特率和低延迟条件下实现了卓越的语音压缩性能,在保持高质量重建和效率的同时超越了现有可流式编解码器。
English: FocalCodec-Stream is a novel hybrid neural audio codec that achieves superior low-bitrate speech compression with minimal latency, outperforming existing streamable codecs while maintaining high reconstruction quality and efficiency.

Authors:Jinghao Zhang, Sihang Jiang, Shiwei Guo, Shisong Chen, Yanghua Xiao, Hongwei Feng, Jiaqing Liang, Minggui HE, Shimin Tao, Hongxia Ma
Title: CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs
Abstract:
As large language models (LLMs) are increasingly deployed in diverse cultural environments, evaluating their cultural understanding capability has become essential for ensuring trustworthy and culturally aligned applications. However, most existing benchmarks lack comprehensiveness and are challenging to scale and adapt across different cultural contexts, because their frameworks often lack guidance from well-established cultural theories and tend to rely on expert-driven manual annotations. To address these issues, we propose CultureScope, the most comprehensive evaluation framework to date for assessing cultural understanding in LLMs. Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification, comprising 3 layers and 140 dimensions, which guides the automated construction of culture-specific knowledge bases and corresponding evaluation datasets for any given languages and cultures. Experimental results demonstrate that our method can effectively evaluate cultural understanding. They also reveal that existing large language models lack comprehensive cultural competence, and merely incorporating multilingual data does not necessarily enhance cultural understanding. All code and data files are available at https://github.com/HoganZinger/Culture
中文摘要:CultureScope基于文化冰山理论提出全面评估框架,通过自动化构建文化知识库来测评大语言模型的文化理解能力,发现现有模型即使具备多语言数据仍存在文化认知缺陷。
English Summary: CultureScope introduces a comprehensive framework based on cultural iceberg theory to evaluate LLMs' cultural understanding through automated knowledge base construction, revealing current models' cultural competence gaps despite multilingual training.

Authors:Xiaoqi Zhao, Youwei Pang, Chenyang Yu, Lihe Zhang, Huchuan Lu, Shijian Lu, Georges El Fakhri, Xiaofeng Liu
Title: UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation
Abstract:
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance. While existing methods address training-inference modality gaps via specialized per-combination models, they introduce high deployment costs by requiring exhaustive model subsets and model-modality matching. In this work, we propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC). Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels. % First, we adopt modality reconstruction with the hybrid shuffled-masking augmentation, encouraging the model to learn the intrinsic modality characteristics and generate meaningful representations for missing modalities through cross-modal fusion. % Next, modality-invariant contrastive learning implicitly compensates the feature space distance among incomplete-complete modality pairs. Furthermore, the proposed lightweight reverse attention adapter explicitly compensates for the weak perceptual semantics in the frozen encoder. Last, UniMRSeg is fine-tuned under the hybrid consistency constraint to ensure stable prediction under all modality combinations without large performance fluctuations. Without bells and whistles, UniMRSeg significantly outperforms the state-of-the-art methods under diverse missing modality scenarios on MRI-based brain tumor segmentation, RGB-D semantic segmentation, RGB-D/T salient object segmentation. The code will be released at https://github.com/Xiaoqi-Zhao-DLUT/UniMRSeg.
Chinese: 本文提出UniMRSeg统一模态松弛分割网络,通过分层自监督补偿方法解决模态缺失导致的性能下降问题,在多种分割任务中均取得了最优性能。
English: This paper introduces UniMRSeg, a unified modality-relax segmentation network that addresses performance degradation from incomplete modalities through hierarchical self-supervised compensation, achieving state-of-the-art results across multiple segmentation tasks.

Authors:Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li
Title: CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion
Abstract:
Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. Recent strides in Code Large Language Models (code LLMs) have spurred the development of repository-level code completion methods, yielding promising results. Nevertheless, they suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, we introduce CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. Its core components include log probability guided query construction, multi-path code retrieval, and preference-aligned BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval demonstrate that CodeRAG significantly and consistently outperforms state-of-the-art methods. The implementation of CodeRAG is available at https://github.com/KDEGroup/CodeRAG.
Chinese Summary: CodeRAG是一种专为存储库级代码补全设计的框架,通过优化查询构建、多路径代码检索和偏好对齐重排序,有效解决了现有方法的不足,并在实验中显著超越了现有最优方法。
English Summary: CodeRAG is a novel framework that enhances repository-level code completion by addressing key limitations in query construction, retrieval methods, and model alignment through innovative techniques like log probability guided queries and multi-path retrieval.

Authors:Shen Cheng, Haipeng Li, Haibin Huang, Xiaohong Liu, Shuaicheng Liu
Title: Blind-Spot Guided Diffusion for Self-supervised Real-World Denoising
Abstract:
In this work, we present Blind-Spot Guided Diffusion, a novel self-supervised framework for real-world image denoising. Our approach addresses two major challenges: the limitations of blind-spot networks (BSNs), which often sacrifice local detail and introduce pixel discontinuities due to spatial independence assumptions, and the difficulty of adapting diffusion models to self-supervised denoising. We propose a dual-branch diffusion framework that combines a BSN-based diffusion branch, generating semi-clean images, with a conventional diffusion branch that captures underlying noise distributions. To enable effective training without paired data, we use the BSN-based branch to guide the sampling process, capturing noise structure while preserving local details. Extensive experiments on the SIDD and DND datasets demonstrate state-of-the-art performance, establishing our method as a highly effective self-supervised solution for real-world denoising. Code and pre-trained models are released at: https://github.com/Sumching/BSGD.
中文: 本文提出盲点引导扩散方法,通过双分支框架结合盲点网络和扩散模型解决自监督图像去噪难题,在标准数据集上取得了领先性能。
English: This paper introduces Blind-Spot Guided Diffusion, a self-supervised dual-branch framework that overcomes blind-spot network limitations and adapts diffusion models for real-world image denoising, achieving state-of-the-art results on benchmark datasets.

Authors:Maithili Joshi, Palash Nandi, Tanmoy Chakraborty
Title: SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection
Abstract:
Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at https://github.com/PalGitts/SABER.
Chinese: 本研究揭示大型语言模型的安全机制主要存在于中后层,并提出SABER方法——通过中间层残差连接绕过安全对齐的白盒越狱技术,在保持性能的同时显著提升攻击成功率。
English: This study reveals that safety mechanisms in Large Language Models (LLMs) primarily reside in middle-to-late layers and introduces SABER, a white-box jailbreak method using residual connections between intermediate layers to bypass safety alignment with minimal performance impact.

Authors:Bhavesh Sandbhor, Bheeshm Sharma, Balamurugan Palaniappan
Title: SLaM-DiMM: Shared Latent Modeling for Diffusion Based Missing Modality Synthesis in MRI
Abstract:
Brain MRI scans are often found in four modalities, consisting of T1-weighted with and without contrast enhancement (T1ce and T1w), T2-weighted imaging (T2w), and Flair. Leveraging complementary information from these different modalities enables models to learn richer, more discriminative features for understanding brain anatomy, which could be used in downstream tasks such as anomaly detection. However, in clinical practice, not all MRI modalities are always available due to various reasons. This makes missing modality generation a critical challenge in medical image analysis. In this paper, we propose SLaM-DiMM, a novel missing modality generation framework that harnesses the power of diffusion models to synthesize any of the four target MRI modalities from other available modalities. Our approach not only generates high-fidelity images but also ensures structural coherence across the depth of the volume through a dedicated coherence enhancement mechanism. Qualitative and quantitative evaluations on the BraTS-Lighthouse-2025 Challenge dataset demonstrate the effectiveness of the proposed approach in synthesizing anatomically plausible and structurally consistent results. Code is available at https://github.com/BheeshmSharma/SLaM-DiMM-MICCAI-BraTS-Challenge-2025.
中文: 本文提出SLaM-DiMM框架,利用扩散模型从现有模态合成缺失的MRI模态,通过一致性增强机制确保结构连贯性和高保真度,并在BraTS-Lighthouse-2025数据集上验证了其有效性。
English: The paper introduces SLaM-DiMM, a diffusion-based framework that synthesizes missing MRI modalities from available ones while ensuring structural coherence and high fidelity, validated on the BraTS-Lighthouse-2025 dataset.

Authors:Yujie Zhu, Charles A. Hepburn, Matthew Thorpe, Giovanni Montana
Title: Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations
Abstract:
In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at https://github.com/YujieZhu7/SPReD.
中文: SPReD提出了一种新颖的强化学习框架,通过集成方法量化不确定性来动态平衡示范模仿与策略探索,采用连续且与不确定性成比例的正则化方法,在机器人任务中实现了显著的性能提升。
English: SPReD introduces a novel reinforcement learning framework that uses ensemble-based uncertainty quantification to dynamically balance imitation of demonstrations with policy exploration, achieving significant performance improvements in robotics tasks through continuous, uncertainty-proportional regularization.

Authors:Shiyu Fang, Yiming Cui, Haoyang Liang, Chen Lv, Peng Hang, Jian Sun
Title: CoReVLA: A Dual-Stage End-to-End Autonomous Driving Framework for Long-Tail Scenarios via Collect-and-Refine
Abstract:
Autonomous Driving (AD) systems have made notable progress, but their performance in long-tail, safety-critical scenarios remains limited. These rare cases contribute a disproportionate number of accidents. Vision-Language Action (VLA) models have strong reasoning abilities and offer a potential solution, but their effectiveness is limited by the lack of high-quality data and inefficient learning in such conditions. To address these challenges, we propose CoReVLA, a continual learning end-to-end autonomous driving framework that improves the performance in long-tail scenarios through a dual-stage process of data Collection and behavior Refinement. First, the model is jointly fine-tuned on a mixture of open-source driving QA datasets, allowing it to acquire a foundational understanding of driving scenarios. Next, CoReVLA is deployed within the Cave Automatic Virtual Environment (CAVE) simulation platform, where driver takeover data is collected from real-time interactions. Each takeover indicates a long-tail scenario that CoReVLA fails to handle reliably. Finally, the model is refined via Direct Preference Optimization (DPO), allowing it to learn directly from human preferences and thereby avoid reward hacking caused by manually designed rewards. Extensive open-loop and closed-loop experiments demonstrate that the proposed CoReVLA model can accurately perceive driving scenarios and make appropriate decisions. On the Bench2Drive benchmark, CoReVLA achieves a Driving Score (DS) of 72.18 and a Success Rate (SR) of 50%, outperforming state-of-the-art methods by 7.96 DS and 15% SR under long-tail, safety-critical scenarios. Furthermore, case studies demonstrate the model's ability to continually improve its performance in similar failure-prone scenarios by leveraging past takeover experiences. All codea and preprocessed datasets are available at: https://github.com/FanGShiYuu/CoReVLA
中文: 提出的CoReVLA框架通过数据收集和行为优化的双阶段持续学习过程,有效提升了自动驾驶在长尾场景中的性能,利用基于仿真的驾驶员接管数据和直接偏好优化方法,在基准测试中取得了优于现有技术的表现。
English: The proposed CoReVLA framework enhances autonomous driving performance in long-tail scenarios through a dual-stage continual learning process of data collection and behavior refinement, achieving superior results on benchmarks by leveraging simulation-based driver takeovers and direct preference optimization.

Authors:Chao Yu, Yuanqing Wang, Zhen Guo, Hao Lin, Si Xu, Hongzhi Zang, Quanlu Zhang, Yongji Wu, Chunyang Zhu, Junhao Hu, Zixiao Huang, Mingjie Wei, Yuqing Xie, Ke Yang, Bo Dai, Zhexuan Xu, Xiangyuan Wang, Xu Fu, Zhihao Liu, Kang Chen, Weilin Liu, Gang Liu, Boxun Li, Jianlei Yang, Zhi Yang, Guohao Dai, Yu Wang
Title: RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation
Abstract:
Reinforcement learning (RL) has demonstrated immense potential in advancing artificial general intelligence, agentic intelligence, and embodied intelligence. However, the inherent heterogeneity and dynamicity of RL workflows often lead to low hardware utilization and slow training on existing systems. In this paper, we present RLinf, a high-performance RL training system based on our key observation that the major roadblock to efficient RL training lies in system flexibility. To maximize flexibility and efficiency, RLinf is built atop a novel RL system design paradigm called macro-to-micro flow transformation (M2Flow), which automatically breaks down high-level, easy-to-compose RL workflows at both the temporal and spatial dimensions, and recomposes them into optimized execution flows. Supported by RLinf worker's adaptive communication capability, we devise context switching and elastic pipelining to realize M2Flow transformation, and a profiling-guided scheduling policy to generate optimal execution plans. Extensive evaluations on both reasoning RL and embodied RL tasks demonstrate that RLinf consistently outperforms state-of-the-art systems, achieving 1.1x-2.13x speedup in end-to-end training throughput.
中文:RLinf通过创新的宏观到微观流程转换设计,构建了灵活的强化学习训练系统,在各项任务中均实现了优于现有系统的性能加速。
English: RLinf introduces a flexible reinforcement learning training system using macro-to-micro flow transformation to optimize workflows, achieving significant speedup over existing systems.

Authors:Zhangqi Jiang, Tingjin Luo, Xu Yang, Xinyan Liang
Title: Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation
Abstract:
View missing remains a significant challenge in graph-based multi-view semi-supervised learning, hindering their real-world applications. To address this issue, traditional methods introduce a missing indicator matrix and focus on mining partial structure among existing samples in each view for label propagation (LP). However, we argue that these disregarded missing samples sometimes induce discontinuous local structures, i.e., sub-clusters, breaking the fundamental smoothness assumption in LP. Consequently, such a Sub-Cluster Problem (SCP) would distort graph fusion and degrade classification performance. To alleviate SCP, we propose a novel incomplete multi-view semi-supervised learning method, termed AGF-TI. Firstly, we design an adversarial graph fusion scheme to learn a robust consensus graph against the distorted local structure through a min-max framework. By stacking all similarity matrices into a tensor, we further recover the incomplete structure from the high-order consistency information based on the low-rank tensor learning. Additionally, the anchor-based strategy is incorporated to reduce the computational complexity. An efficient alternative optimization algorithm combining a reduced gradient descent method is developed to solve the formulated objective, with theoretical convergence. Extensive experimental results on various datasets validate the superiority of our proposed AGF-TI as compared to state-of-the-art methods. Code is available at https://github.com/ZhangqiJiang07/AGF_TI.
中文摘要:提出的AGF-TI方法通过对抗性图融合与张量补全技术相结合,有效解决不完整多视图学习中的子簇问题,从而提升分类性能。
English Summary: The proposed AGF-TI method addresses the sub-cluster problem in incomplete multi-view learning by combining adversarial graph fusion with tensor completion to enhance classification performance.

Authors:Gang Yang, Yue Lei, Wenxin Tai, Jin Wu, Jia Chen, Ting Zhong, Fan Zhou
Title: Compose Yourself: Average-Velocity Flow Matching for One-Step Speech Enhancement
Abstract:
Diffusion and flow matching (FM) models have achieved remarkable progress in speech enhancement (SE), yet their dependence on multi-step generation is computationally expensive and vulnerable to discretization errors. Recent advances in one-step generative modeling, particularly MeanFlow, provide a promising alternative by reformulating dynamics through average velocity fields. In this work, we present COSE, a one-step FM framework tailored for SE. To address the high training overhead of Jacobian-vector product (JVP) computations in MeanFlow, we introduce a velocity composition identity to compute average velocity efficiently, eliminating expensive computation while preserving theoretical consistency and achieving competitive enhancement quality. Extensive experiments on standard benchmarks show that COSE delivers up to 5x faster sampling and reduces training cost by 40%, all without compromising speech quality. Code is available at https://github.com/ICDM-UESTC/COSE.
Chinese: COSE提出了一种用于语音增强的单步流匹配框架,通过速度组合恒等式消除了昂贵的雅可比向量积计算,在保持竞争力的语音质量的同时,实现了5倍加速采样和40%训练成本降低。
English: COSE introduces a one-step flow matching framework for speech enhancement that uses a velocity composition identity to eliminate expensive Jacobian-vector product computations, achieving 5x faster sampling and 40% lower training cost while maintaining competitive quality.

Authors:Katharina Eckstein, Constantin Ulrich, Michael Baumgartner, Jessica Kächele, Dimitrios Bounias, Tassilo Wald, Ralf Floca, Klaus H. Maier-Hein
Title: The Missing Piece: A Case for Pre-Training in 3D Medical Object Detection
Abstract:
Large-scale pre-training holds the promise to advance 3D medical object detection, a crucial component of accurate computer-aided diagnosis. Yet, it remains underexplored compared to segmentation, where pre-training has already demonstrated significant benefits. Existing pre-training approaches for 3D object detection rely on 2D medical data or natural image pre-training, failing to fully leverage 3D volumetric information. In this work, we present the first systematic study of how existing pre-training methods can be integrated into state-of-the-art detection architectures, covering both CNNs and Transformers. Our results show that pre-training consistently improves detection performance across various tasks and datasets. Notably, reconstruction-based self-supervised pre-training outperforms supervised pre-training, while contrastive pre-training provides no clear benefit for 3D medical object detection. Our code is publicly available at: https://github.com/MIC-DKFZ/nnDetection-finetuning.
中文: 大规模预训练显著提升3D医学目标检测性能,其中基于重建的自监督方法效果最佳,而对比式预训练则收效甚微。
English: Large-scale pre-training significantly enhances 3D medical object detection, with reconstruction-based self-supervised methods proving most effective, while contrastive pre-training shows limited benefits.

Authors:Zhengyao Huang, Daniel Zhengyu Huang, Tiannan Xiao, Dina Ma, Zhenyu Ming, Hao Shi, Yuanhui Wen
Title: Improving Monte Carlo Tree Search for Symbolic Regression
Abstract:
Symbolic regression aims to discover concise, interpretable mathematical expressions that satisfy desired objectives, such as fitting data, posing a highly combinatorial optimization problem. While genetic programming has been the dominant approach, recent efforts have explored reinforcement learning methods for improving search efficiency. Monte Carlo Tree Search (MCTS), with its ability to balance exploration and exploitation through guided search, has emerged as a promising technique for symbolic expression discovery. However, its traditional bandit strategies and sequential symbol construction often limit performance. In this work, we propose an improved MCTS framework for symbolic regression that addresses these limitations through two key innovations: (1) an extreme bandit allocation strategy tailored for identifying globally optimal expressions, with finite-time performance guarantees under polynomial reward decay assumptions; and (2) evolution-inspired state-jumping actions such as mutation and crossover, which enable non-local transitions to promising regions of the search space. These state-jumping actions also reshape the reward landscape during the search process, improving both robustness and efficiency. We conduct a thorough numerical study to the impact of these improvements and benchmark our approach against existing symbolic regression methods on a variety of datasets, including both ground-truth and black-box datasets. Our approach achieves competitive performance with state-of-the-art libraries in terms of recovery rate, attains favorable positions on the Pareto frontier of accuracy versus model complexity. Code is available at https://github.com/PKU-CMEGroup/MCTS-4-SR.
中文摘要:本研究提出了一种改进的蒙特卡洛树搜索框架,通过极值赌博机分配策略和演化启发的状态跳跃动作,有效提升了符号回归的搜索效率与鲁棒性,在多个数据集上展现出与先进方法相竞争的性能。
English Summary: This paper introduces an enhanced Monte Carlo Tree Search framework for symbolic regression that incorporates an extreme bandit strategy with performance guarantees and evolution-inspired state-jumping actions to improve search efficiency and robustness.

Authors:Johannes Köhler, Daniel Zhang, Raffaele Soloperto, Andrea Carron, Melanie Zeilinger
Title: An MPC framework for efficient navigation of mobile robots in cluttered environments
Abstract:
We present a model predictive control (MPC) framework for efficient navigation of mobile robots in cluttered environments. The proposed approach integrates a finite-segment shortest path planner into the finite-horizon trajectory optimization of the MPC. This formulation ensures convergence to dynamically selected targets and guarantees collision avoidance, even under general nonlinear dynamics and cluttered environments. The approach is validated through hardware experiments on a small ground robot, where a human operator dynamically assigns target locations. The robot successfully navigated through complex environments and reached new targets within 2-3 seconds.
中文: 本研究提出了一种模型预测控制框架,将路径规划与轨迹优化相结合,使移动机器人能够在复杂环境中安全高效导航,硬件实验证明其可在数秒内实现目标抵达。
English: This study introduces a model predictive control framework that combines path planning with trajectory optimization to enable mobile robots to navigate cluttered environments safely and efficiently, achieving target convergence within seconds as demonstrated in hardware experiments.

Authors:David Calhas, Arlindo L. Oliveira
Title: Deep Feedback Models
Abstract:
Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at https://github.com/DCalhas/deep_feedback_models.
中文: 深度反馈模型通过引入反馈机制实现内部状态的迭代优化,在噪声鲁棒性和小样本泛化能力上显著优于前馈网络,尤其在物体识别和医学影像任务中表现突出。
English: Deep Feedback Models (DFMs) enhance neural networks by integrating feedback mechanisms for iterative state refinement, demonstrating superior performance in noise resilience and data efficiency for tasks like object recognition and medical imaging.

Authors:Jiahao Li, Xinhong Chen, Zhengmin Jiang, Qian Zhou, Yung-Hui Li, Jianping Wang
Title: Global Regulation and Excitation via Attention Tuning for Stereo Matching
Abstract:
Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at https://github.com/JarvisLee0423/GREAT-Stereo.
中文摘要:GREAT框架通过整合三种注意力模块,为迭代式立体匹配方法引入全局上下文和几何信息,显著提升了在复杂区域的匹配性能,并在多个基准测试中取得领先成绩。
English Summary: The GREAT framework enhances iterative stereo-matching methods by integrating three attention modules to capture global context and geometric information, significantly improving performance in challenging regions and achieving top results on multiple benchmarks.

Authors:Liwei Liao, Xufeng Li, Xiaoyun Zheng, Boning Liu, Feng Gao, Ronggang Wang
Title: Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
Abstract:
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.
中文: 提出的GVR框架通过将3D物体定位转化为2D视图检索任务,为3D高斯溅射实现了零样本视觉定位,不仅无需逐场景训练和大量标注数据,还取得了最先进的性能表现。
English: The proposed GVR framework introduces a zero-shot visual grounding method for 3D Gaussian Splatting by transforming 3D object localization into a 2D view retrieval task, eliminating the need for per-scene training and extensive annotations while achieving state-of-the-art performance.

Authors:Alina Kostromina, Kseniia Kuvshinova, Aleksandr Yugay, Andrey Savchenko, Dmitry Simakov
Title: Tsururu: A Python-based Time Series Forecasting Strategies Library
Abstract:
While current time series research focuses on developing new models, crucial questions of selecting an optimal approach for training such models are underexplored. Tsururu, a Python library introduced in this paper, bridges SoTA research and industry by enabling flexible combinations of global and multivariate approaches and multi-step-ahead forecasting strategies. It also enables seamless integration with various forecasting models. Available at https://github.com/sb-ai-lab/tsururu .
中文: 本文介绍了Tsururu这一Python库,它通过灵活结合全局与多元方法及多步预测策略,并实现与多种预测模型的无缝集成,弥合了前沿研究与工业应用之间的鸿沟。
English: This paper introduces Tsururu, a Python library that bridges the gap between state-of-the-art research and industry by enabling flexible combinations of global and multivariate approaches with multi-step-ahead forecasting strategies, while also allowing seamless integration with various forecasting models.

Authors:Zhongze Luo, Zhenshuai Yin, Yongxin Guo, Zhichao Wang, Jionghao Zhu, Xiaoying Tang
Title: Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems
Abstract:
While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce \textbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https://github.com/luozhongze/Multi-Physics.
中文摘要:本研究针对多模态大模型在物理推理评估中的不足,推出了Multi-Physics中文基准,通过多难度题目和双评估框架系统分析模型答案准确性与思维链完整性。
English Summary: This study introduces the Multi-Physics benchmark to address gaps in evaluating multimodal LLMs for Chinese physics reasoning, featuring multi-level questions and a dual evaluation framework that assesses both answer accuracy and reasoning integrity.

Authors:Xueping Zhang, Liwei Jin, Yechen Wang, Linxi Li, Ming Li
Title: CompSpoof: A Dataset and Joint Learning Framework for Component-Level Audio Anti-spoofing Countermeasures
Abstract:
Component-level audio Spoofing (Comp-Spoof) targets a new form of audio manipulation where only specific components of a signal, such as speech or environmental sound, are forged or substituted while other components remain genuine. Existing anti-spoofing datasets and methods treat an utterance or a segment as entirely bona fide or entirely spoofed, and thus cannot accurately detect component-level spoofing. To address this, we construct a new dataset, CompSpoof, covering multiple combinations of bona fide and spoofed speech and environmental sound. We further propose a separation-enhanced joint learning framework that separates audio components apart and applies anti-spoofing models to each one. Joint learning is employed, preserving information relevant for detection. Extensive experiments demonstrate that our method outperforms the baseline, highlighting the necessity of separate components and the importance of detecting spoofing for each component separately. Datasets and code are available at: https://github.com/XuepingZhang/CompSpoof.
Chinese: 组件级音频伪造针对信号中特定部分(如语音或环境声音)进行篡改,而其他部分保持真实,作者构建了新数据集并提出分离增强的联合学习框架,通过分别检测各组件来有效识别此类伪造。
English: Component-level audio spoofing involves forging specific parts of an audio signal, such as speech or environmental sounds, while keeping other components authentic, and the authors introduce a new dataset and a separation-enhanced joint learning framework to effectively detect such spoofing by analyzing each component separately.

Authors:Haotian Zhang, Han Guo, Keyan Chen, Hao Chen, Zhengxia Zou, Zhenwei Shi
Title: FoBa: A Foreground-Background co-Guided Method and New Benchmark for Remote Sensing Semantic Change Detection
Abstract:
Despite the remarkable progress achieved in remote sensing semantic change detection (SCD), two major challenges remain. At the data level, existing SCD datasets suffer from limited change categories, insufficient change types, and a lack of fine-grained class definitions, making them inadequate to fully support practical applications. At the methodological level, most current approaches underutilize change information, typically treating it as a post-processing step to enhance spatial consistency, which constrains further improvements in model performance. To address these issues, we construct a new benchmark for remote sensing SCD, LevirSCD. Focused on the Beijing area, the dataset covers 16 change categories and 210 specific change types, with more fine-grained class definitions (e.g., roads are divided into unpaved and paved roads). Furthermore, we propose a foreground-background co-guided SCD (FoBa) method, which leverages foregrounds that focus on regions of interest and backgrounds enriched with contextual information to guide the model collaboratively, thereby alleviating semantic ambiguity while enhancing its ability to detect subtle changes. Considering the requirements of bi-temporal interaction and spatial consistency in SCD, we introduce a Gated Interaction Fusion (GIF) module along with a simple consistency loss to further enhance the model's detection performance. Extensive experiments on three datasets (SECOND, JL1, and the proposed LevirSCD) demonstrate that FoBa achieves competitive results compared to current SOTA methods, with improvements of 1.48%, 3.61%, and 2.81% in the SeK metric, respectively. Our code and dataset are available at https://github.com/zmoka-zht/FoBa.
Chinese: 该研究提出了LevirSCD遥感语义变化检测数据集,包含细粒度变化类别,并开发了FoBa方法,通过前景-背景协同引导和门控交互融合模块,在三个数据集上相比现有方法取得了显著性能提升。
English: The study introduces LevirSCD, a new remote sensing semantic change detection dataset with fine-grained categories, and proposes the FoBa method that uses foreground-background co-guidance and a Gated Interaction Fusion module to significantly improve detection performance over existing approaches.

Authors:Chang Soo Lim, Joonyoung Moon, Donghyeon Cho
Title: Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution
Abstract:
Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at https://github.com/2025-LSVOS-3rd-place/MOSEv2_3rd_place.
Chinese: 提出的SCOPE框架通过将SAM2的ViT编码器和运动预测模块整合到Cutie中,增强了视频对象分割能力,并采用集成策略在LSVOS挑战赛中荣获第三名。
English: The proposed SCOPE framework enhances video object segmentation by integrating SAM2's enriched ViT encoder and a motion prediction module into Cutie, achieving third place in the LSVOS Challenge through an ensemble strategy.

Authors:Yang Li, Tingfa Xu, Shuyan Bai, Peifu Liu, Jianan Li
Title: MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection
Abstract:
Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at https://github.com/yl2900260-bit/MCOD.
中文: 针对基于RGB的伪装目标检测的局限性,MCOD基准数据集应运而生,它采用多光谱图像,包含多样化挑战和高质量标注,通过整合光谱信息显著提升了检测的鲁棒性。
English: To address the limitations of RGB-based camouflaged object detection, the MCOD benchmark dataset is introduced, featuring multispectral imagery with diverse challenges and high-quality annotations, which significantly improves detection robustness through spectral information integration.

Authors:Yongsheng Feng, Yuetonghui Xu, Jiehui Luo, Hongjia Liu, Xiaobing Li, Feng Yu, Wei Li
Title: TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation
Abstract:
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.
中文摘要:TISDiSS是一种可扩展的源分离框架,通过动态推理重复实现灵活的速率-性能权衡,以更少的参数取得了最先进的性能。
English Summary: TISDiSS is a scalable source separation framework that enables flexible speed-performance trade-offs through dynamic inference repetitions, achieving state-of-the-art results with fewer parameters.

Authors:Fangyuan Mao, Shuo Wang, Jilin Mei, Chen Min, Shun Lu, Fuyang Liu, Yu Hu
Title: UNIV: Unified Foundation Model for Infrared and Visible Modalities
Abstract:
The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at https://github.com/fangyuanmao/UNIV.
中文: UNIV基础模型通过仿生创新解决了RGB可见光与红外感知的多模态性能差距:采用注意力引导的跨模态对比学习实现特征对齐,结合双知识保留机制防止灾难性遗忘,在提升红外任务性能的同时保持了RGB任务的基准表现。
English: The UNIV foundation model addresses multimodal performance gaps in RGB-visible and infrared perception through biologically inspired innovations—Patch-wise Cross-modality Contrastive Learning for feature alignment and a dual-knowledge preservation mechanism to prevent catastrophic forgetting—demonstrating superior infrared task performance while maintaining RGB capability.

Authors:Pan Tang, Shixiang Tang, Huanqi Pu, Zhiqing Miao, Zhixing Wang
Title: MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents
Abstract:
This paper presents MicroRCA-Agent, an innovative solution for microservice root cause analysis based on large language model agents, which constructs an intelligent fault root cause localization system with multimodal data fusion. The technical innovations are embodied in three key aspects: First, we combine the pre-trained Drain log parsing algorithm with multi-level data filtering mechanism to efficiently compress massive logs into high-quality fault features. Second, we employ a dual anomaly detection approach that integrates Isolation Forest unsupervised learning algorithms with status code validation to achieve comprehensive trace anomaly identification. Third, we design a statistical symmetry ratio filtering mechanism coupled with a two-stage LLM analysis strategy to enable full-stack phenomenon summarization across node-service-pod hierarchies. The multimodal root cause analysis module leverages carefully designed cross-modal prompts to deeply integrate multimodal anomaly information, fully exploiting the cross-modal understanding and logical reasoning capabilities of large language models to generate structured analysis results encompassing fault components, root cause descriptions, and reasoning trace. Comprehensive ablation studies validate the complementary value of each modal data and the effectiveness of the system architecture. The proposed solution demonstrates superior performance in complex microservice fault scenarios, achieving a final score of 50.71. The code has been released at: https://github.com/tangpan360/MicroRCA-Agent.
中文摘要:MicroRCA-Agent是一种基于大语言模型的智能故障根因定位系统,通过多模态数据融合、日志压缩和双重异常检测机制,在复杂微服务场景中实现精准的故障分析与诊断。
English Summary: MicroRCA-Agent is an intelligent fault localization system that leverages large language models and multimodal data fusion to achieve comprehensive root cause analysis in microservices through log compression, dual anomaly detection, and cross-modal reasoning.

Authors:Zheng Wang, Hong Liu, Zheng Wang, Danyi Li, Min Cen, Baptiste Magnier, Li Liang, Liansheng Wang
Title: Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation
Abstract:
Survival analysis based on Whole Slide Images (WSIs) is crucial for evaluating cancer prognosis, as they offer detailed microscopic information essential for predicting patient outcomes. However, traditional WSI-based survival analysis usually faces noisy features and limited data accessibility, hindering their ability to capture critical prognostic features effectively. Although pathology reports provide rich patient-specific information that could assist analysis, their potential to enhance WSI-based survival analysis remains largely unexplored. To this end, this paper proposes a novel Report-auxiliary self-distillation (Rasa) framework for WSI-based survival analysis. First, advanced large language models (LLMs) are utilized to extract fine-grained, WSI-relevant textual descriptions from original noisy pathology reports via a carefully designed task prompt. Next, a self-distillation-based pipeline is designed to filter out irrelevant or redundant WSI features for the student model under the guidance of the teacher model's textual knowledge. Finally, a risk-aware mix-up strategy is incorporated during the training of the student model to enhance both the quantity and diversity of the training data. Extensive experiments carried out on our collected data (CRC) and public data (TCGA-BRCA) demonstrate the superior effectiveness of Rasa against state-of-the-art methods. Our code is available at https://github.com/zhengwang9/Rasa.
中文: 本文提出Rasa框架,通过大型语言模型从病理报告中提取相关文本特征,并采用自蒸馏和风险感知混合策略优化全切片图像特征筛选与数据多样性,在CRC和TCGA-BRCA数据集上实现了卓越的癌症生存分析性能。
English: This paper introduces the Rasa framework, which enhances WSI-based cancer survival analysis by leveraging LLMs to extract relevant textual features from pathology reports and employing self-distillation with a risk-aware mix-up strategy to improve feature selection and data diversity, achieving superior performance on CRC and TCGA-BRCA datasets.

Authors:Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi, Woon-Seng Gan
Title: MAGENTA: Magnitude and Geometry-ENhanced Training Approach for Robust Long-Tailed Sound Event Localization and Detection
Abstract:
Deep learning-based Sound Event Localization and Detection (SELD) systems degrade significantly on real-world, long-tailed datasets. Standard regression losses bias learning toward frequent classes, causing rare events to be systematically under-recognized. To address this challenge, we introduce MAGENTA (Magnitude And Geometry-ENhanced Training Approach), a unified loss function that counteracts this bias within a physically interpretable vector space. MAGENTA geometrically decomposes the regression error into radial and angular components, enabling targeted, rarity-aware penalties and strengthened directional modeling. Empirically, MAGENTA substantially improves SELD performance on imbalanced real-world data, providing a principled foundation for a new class of geometry-aware SELD objectives. Code is available at: https://github.com/itsjunwei/MAGENTA_ICASSP
中文摘要:MAGENTA框架通过将回归误差分解为径向和角度分量,提出统一的损失函数来解决基于深度学习的声学事件定位与检测系统在非平衡数据集上的性能下降问题,实现针对稀有事件的优化。
English Summary: The MAGENTA framework addresses performance degradation in deep learning-based Sound Event Localization and Detection systems on imbalanced datasets by introducing a unified loss function that decomposes regression errors into radial and angular components for rarity-aware optimization.

Authors:Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin
Title: Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification
Abstract:
Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.
中文: 潜在分区网络(LZN)通过构建共享高斯潜空间和任务专用编解码器,统一了生成建模、表征学习和分类三大机器学习核心任务,在保持训练目标不变的前提下实现了多项性能提升。
English: The Latent Zoning Network (LZN) unifies generative modeling, representation learning, and classification by creating a shared Gaussian latent space with task-specific encoders and decoders, demonstrating improved performance across diverse machine learning tasks without modifying core training objectives.

Authors:Runxin Zhao, Chunxiang Wang, Hanyang Zhuang, Ming Yang
Title: Bench-RNR: Dataset for Benchmarking Repetitive and Non-repetitive Scanning LiDAR for Infrastructure-based Vehicle Localization
Abstract:
Vehicle localization using roadside LiDARs can provide centimeter-level accuracy for cloud-controlled vehicles while simultaneously serving multiple vehicles, enhanc-ing safety and efficiency. While most existing studies rely on repetitive scanning LiDARs, non-repetitive scanning LiDAR offers advantages such as eliminating blind zones and being more cost-effective. However, its application in roadside perception and localization remains limited. To address this, we present a dataset for infrastructure-based vehicle localization, with data collected from both repetitive and non-repetitive scanning LiDARs, in order to benchmark the performance of different LiDAR scanning patterns. The dataset contains 5,445 frames of point clouds across eight vehicle trajectory sequences, with diverse trajectory types. Our experiments establish base-lines for infrastructure-based vehicle localization and compare the performance of these methods using both non-repetitive and repetitive scanning LiDARs. This work offers valuable insights for selecting the most suitable LiDAR scanning pattern for infrastruc-ture-based vehicle localization. Our dataset is a signifi-cant contribution to the scientific community, supporting advancements in infrastructure-based perception and vehicle localization. The dataset and source code are publicly available at: https://github.com/sjtu-cyberc3/BenchRNR.
中文: 本研究提出了一个基于路侧设施的车辆定位数据集,通过对比重复和非重复扫描激光雷达的性能,为选择最佳扫描模式提供依据。
English: This study introduces a dataset for infrastructure-based vehicle localization, comparing repetitive and non-repetitive scanning LiDARs to evaluate their performance and provide insights for optimal scanning pattern selection.

Authors:Shilong Bao, Qianqian Xu, Feiran Li, Boyu Han, Zhiyong Yang, Xiaochun Cao, Qingming Huang
Title: Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach
Abstract:
This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach. The code is available at https://github.com/Ferry-Li/SI-SOD.
本文针对显著目标检测中评估指标对尺寸的敏感性问题,提出了一个尺寸不变性评估框架,通过独立评估各组件来消除尺寸偏差,确保不同大小目标的公平检测。
This paper identifies and addresses the size bias in Salient Object Detection metrics by proposing a Size-Invariant Evaluation framework that ensures balanced assessment across objects of varying sizes.

Authors:Shilong Bao, Qianqian Xu, Feiran Li, Boyu Han, Zhiyong Yang, Xiaochun Cao, Qingming Huang
Title: Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach
Abstract:
This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach. The code is available at https://github.com/Ferry-Li/SI-SOD.
本文针对显著目标检测中评估指标对尺寸的敏感性问题,提出了一个尺寸不变性评估框架,通过独立评估各组件来消除尺寸偏差,确保不同大小目标的公平检测。
This paper identifies and addresses the size bias in Salient Object Detection metrics by proposing a Size-Invariant Evaluation framework that ensures balanced assessment across objects of varying sizes.

Authors:Tian Lan, Yiming Zheng, Jianxin Yin
Title: Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification
Abstract:
Multi-label classification has broad applications and depends on powerful representations capable of capturing multi-label interactions. We introduce \textit{Diff-Feat}, a simple but powerful framework that extracts intermediate features from pre-trained diffusion-Transformer models for images and text, and fuses them for downstream tasks. We observe that for vision tasks, the most discriminative intermediate feature along the diffusion process occurs at the middle step and is located in the middle block in Transformer. In contrast, for language tasks, the best feature occurs at the noise-free step and is located in the deepest block. In particular, we observe a striking phenomenon across varying datasets: a mysterious "Layer $12$" consistently yields the best performance on various downstream classification tasks for images (under DiT-XL/2-256$\times$256). We devise a heuristic local-search algorithm that pinpoints the locally optimal "image-text"$\times$"block-timestep" pair among a few candidates, avoiding an exhaustive grid search. A simple fusion-linear projection followed by addition-of the selected representations yields state-of-the-art performance: 98.6\% mAP on MS-COCO-enhanced and 45.7\% mAP on Visual Genome 500, surpassing strong CNN, graph, and Transformer baselines by a wide margin. t-SNE and clustering metrics further reveal that \textit{Diff-Feat} forms tighter semantic clusters than unimodal counterparts. The code is available at https://github.com/lt-0123/Diff-Feat.
Chinese: Diff-Feat框架通过提取预训练扩散Transformer模型的图像和文本中间特征并进行融合,采用启发式搜索寻找最优特征对,在多标签分类任务中实现了最先进的性能。
English: The Diff-Feat framework extracts and fuses intermediate features from pre-trained diffusion-Transformer models for images and text, achieving state-of-the-art multi-label classification performance through a heuristic search for optimal feature pairs.

Authors:Xiaowei Zhu, Yubing Ren, Fang Fang, Qingfeng Tan, Shi Wang, Yanan Cao
Title: DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm
Abstract:
The rapid advancement of large language models (LLMs) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant overlap between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text. Building on this perspective, we introduce DNA-DetectLLM, a zero-shot detection method for distinguishing AI-generated and human-written text. The method constructs an ideal AI-generated sequence for each input, iteratively repairs non-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal. Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-DetectLLM achieves relative improvements of 5.55% in AUROC and 2.08% in F1 score across multiple public benchmark datasets. Code and data are available at https://github.com/Xiaoweizhu57/DNA-DetectLLM.
大型语言模型的快速发展使得区分AI生成文本与人类写作愈发困难,为此我们提出了DNA-DetectLLM,一种零样本检测方法,通过修复机制实现了最先进的检测精度和鲁棒性。
The rapid progress of large language models has made distinguishing AI-generated text from human writing increasingly difficult, prompting the development of DNA-DetectLLM, a novel zero-shot detection method that uses a repair-based approach to achieve state-of-the-art accuracy and robustness.

Authors:Wei Chen, Tongguan Wang, Feiyue Xue, Junkai Li, Hui Liu, Ying Sha
Title: Beyond Words: Enhancing Desire, Emotion, and Sentiment Recognition with Non-Verbal Cues
Abstract:
Desire, as an intention that drives human behavior, is closely related to both emotion and sentiment. Multimodal learning has advanced sentiment and emotion recognition, but multimodal approaches specially targeting human desire understanding remain underexplored. And existing methods in sentiment analysis predominantly emphasize verbal cues and overlook images as complementary non-verbal cues. To address these gaps, we propose a Symmetrical Bidirectional Multimodal Learning Framework for Desire, Emotion, and Sentiment Recognition, which enforces mutual guidance between text and image modalities to effectively capture intention-related representations in the image. Specifically, low-resolution images are used to obtain global visual representations for cross-modal alignment, while high resolution images are partitioned into sub-images and modeled with masked image modeling to enhance the ability to capture fine-grained local features. A text-guided image decoder and an image-guided text decoder are introduced to facilitate deep cross-modal interaction at both local and global representations of image information. Additionally, to balance perceptual gains with computation cost, a mixed-scale image strategy is adopted, where high-resolution images are cropped into sub-images for masked modeling. The proposed approach is evaluated on MSED, a multimodal dataset that includes a desire understanding benchmark, as well as emotion and sentiment recognition. Experimental results indicate consistent improvements over other state-of-the-art methods, validating the effectiveness of our proposed method. Specifically, our method outperforms existing approaches, achieving F1-score improvements of 1.1% in desire understanding, 0.6% in emotion recognition, and 0.9% in sentiment analysis. Our code is available at: https://github.com/especiallyW/SyDES.
中文: 本文提出了一种对称双向多模态学习框架,通过文本与图像模态的相互引导来增强欲望、情感和情绪识别能力,在三个任务上均实现了性能提升并超越了现有最佳方法。
English: This paper introduces a Symmetrical Bidirectional Multimodal Learning Framework that enhances desire, emotion, and sentiment recognition by enabling mutual guidance between text and image modalities, achieving state-of-the-art performance improvements across all three tasks.

Authors:Abdarahmane Traore, Éric Hervet, Andy Couturier
Title: SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters
Abstract:
Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives. These findings highlight the potential for efficient, deployable multimodal intelligence in real-world settings without sacrificing core spatial reasoning capabilities. The code of the experimentation will be available at: https://github.com/abtraore/SmolRGPT
中文: SmolRGPT是一种紧凑的视觉语言模型,通过融合RGB和深度信息实现高效空间推理,仅用6亿参数即可获得优异性能,适用于资源受限的实际应用场景。
English: SmolRGPT is a compact vision-language model that integrates RGB and depth cues for efficient spatial reasoning, achieving competitive performance with only 600M parameters while enabling deployment in resource-constrained environments.

Authors:Lioz Berman, Sharon Gannot, Tom Tirer
Title: (SP)$^2$-Net: A Neural Spatial Spectrum Method for DOA Estimation
Abstract:
We consider the problem of estimating the directions of arrival (DOAs) of multiple sources from a single snapshot of an antenna array, a task with many practical applications. In such settings, the classical Bartlett beamformer is commonly used, as maximum likelihood estimation becomes impractical when the number of sources is unknown or large, and spectral methods based on the sample covariance are not applicable due to the lack of multiple snapshots. However, the accuracy and resolution of the Bartlett beamformer are fundamentally limited by the array aperture. In this paper, we propose a deep learning technique, comprising a novel architecture and training strategy, for generating a high-resolution spatial spectrum from a single snapshot. Specifically, we train a deep neural network that takes the measurements and a hypothesis angle as input and learns to output a score consistent with the capabilities of a much wider array. At inference time, a heatmap can be produced by scanning an arbitrary set of angles. We demonstrate the advantages of our trained model, named (SP)$^2$-Net, over the Bartlett beamformer and sparsity-based DOA estimation methods.
本文提出了一种名为(SP)$^2$-Net的深度学习方法,通过模拟更宽阵列的能力,从单次快照中提高了到达方向估计的精度,优于传统的Bartlett波束形成器等方法。
This paper introduces a deep learning approach, (SP)$^2$-Net, that enhances direction of arrival estimation accuracy from a single snapshot by simulating a wider array's capabilities, outperforming traditional methods like the Bartlett beamformer.

Authors:Kevin Ren, Santiago Cortes-Gomez, Carlos Miguel Patiño, Ananya Joshi, Ruiqi Lyu, Jingjing Tang, Alistair Turcan, Khurram Yamin, Steven Wu, Bryan Wilder
Title: Predicting Language Models' Success at Zero-Shot Probabilistic Prediction
Abstract:
Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have confidence that an LLM will provide high-quality predictions for their particular task? To address this question, we conduct a large-scale empirical study of LLMs' zero-shot predictive capabilities across a wide range of tabular prediction tasks. We find that LLMs' performance is highly variable, both on tasks within the same dataset and across different datasets. However, when the LLM performs well on the base prediction task, its predicted probabilities become a stronger signal for individual-level accuracy. Then, we construct metrics to predict LLMs' performance at the task level, aiming to distinguish between tasks where LLMs may perform well and where they are likely unsuitable. We find that some of these metrics, each of which are assessed without labeled data, yield strong signals of LLMs' predictive performance on new tasks.
中文: 近期研究探讨大型语言模型作为零样本预测器在个体特征预测中的应用,发现其性能在不同任务间差异显著,但当基础预测准确时预测质量提升,新构建的指标可有效识别适用场景。
English: Recent research explores large language models as zero-shot predictors for individual-level characteristics, finding their performance varies widely across tasks but improves when base predictions are accurate, with new metrics helping identify suitable applications.

Authors:Yulin Wang, Yang Yue, Yang Yue, Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, Qisen Yang, Andrew Zhao, Zhuofan Xia, Shiji Song, Gao Huang
Title: Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception
Abstract:
Human vision is highly adaptive, efficiently sampling intricate environments by sequentially fixating on task-relevant regions. In contrast, prevailing machine vision models passively process entire scenes at once, resulting in excessive resource demands scaling with spatial-temporal input resolution and model size, yielding critical limitations impeding both future advancements and real-world application. Here we introduce AdaptiveNN, a general framework aiming to drive a paradigm shift from 'passive' to 'active, adaptive' vision models. AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process, progressively identifying and attending to regions pertinent to the task, incrementally combining information across fixations, and actively concluding observation when sufficient. We establish a theory integrating representation learning with self-rewarding reinforcement learning, enabling end-to-end training of the non-differentiable AdaptiveNN without additional supervision on fixation locations. We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, processing images from real driving and medical scenarios, language-driven embodied AI, and side-by-side comparisons with humans. AdaptiveNN achieves up to 28x inference cost reduction without sacrificing accuracy, flexibly adapts to varying task demands and resource budgets without retraining, and provides enhanced interpretability via its fixation patterns, demonstrating a promising avenue toward efficient, flexible, and interpretable computer vision. Furthermore, AdaptiveNN exhibits closely human-like perceptual behaviors in many cases, revealing its potential as a valuable tool for investigating visual cognition. Code is available at https://github.com/LeapLabTHU/AdaptiveNN.
中文摘要:AdaptiveNN提出了一种主动视觉框架,通过模拟人眼注视机制实现从粗到精的序列化视觉处理,在保持精度的同时大幅降低计算成本,并在多任务中展现出类人的感知特性与良好可解释性。
English Summary: AdaptiveNN introduces an active vision framework that mimics human eye movements to process visual information sequentially, significantly reducing computational costs while maintaining accuracy and enhancing interpretability across diverse tasks.

Authors:Emilie Kibsgaard, Anita Sue Jwa, Christopher J Markiewicz, David Rodriguez Gonzalez, Judith Sainz Pardo, Russell A. Poldrack, Cyril R. Pernet
Title: Assessing metadata privacy in neuroimaging
Abstract:
The ethical and legal imperative to share research data without causing harm requires careful attention to privacy risks. While mounting evidence demonstrates that data sharing benefits science, legitimate concerns persist regarding the potential leakage of personal information that could lead to reidentification and subsequent harm. We reviewed metadata accompanying neuroimaging datasets from six heterogeneous studies openly available on OpenNeuro, involving participants across the lifespan, from children to older adults, with and without clinical diagnoses, and including associated clinical score data. Using metaprivBIDS (https://github.com/CPernet/metaprivBIDS), a novel tool for the systematic assessment of privacy in tabular data, we found that privacy is generally well maintained, with serious vulnerabilities being rare. Nonetheless, minor issues were identified in nearly all datasets and warrant mitigation. Notably, clinical score data (e.g., neuropsychological results) posed minimal reidentification risk, whereas demographic variables (age, sex, race, income, and geolocation) represented the principal privacy vulnerabilities. We outline practical measures to address these risks, enabling safer data sharing practices.
中文: 研究数据共享需在科学获益与隐私保护间取得平衡,神经影像研究表明尽管整体防护良好,但人口统计学变量仍是重识别风险的主要来源。
English: Data sharing in research requires balancing scientific benefits with privacy protection, as neuroimaging studies show demographic variables pose the main reidentification risks despite overall secure practices.

Authors:Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee
Title: Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages
Abstract:
The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: https://github.com/Social-AI-Studio/SGToxicGuard.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}
中文:本研究提出了SGToxicGuard数据集和评估框架,用于测试大型语言模型在新加坡多语言环境中的安全性,揭示了关键漏洞并为构建更安全的AI系统提供了实践指导。
English: This study introduces SGToxicGuard, a dataset and framework for evaluating the safety of large language models in Singapore's multilingual context, revealing critical vulnerabilities and providing insights for safer AI systems.

Authors:Wenda Qin, Andrea Burns, Bryan A. Plummer, Margrit Betke
Title: Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning
Abstract:
Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks, but are costly to run in resource-limited environments. Token pruning offers appealing tradeoffs for efficiency with minimal performance loss by reducing model input size, but prior work overlooks VLN-specific challenges. For example, information loss from pruning can effectively increase computational cost due to longer walks. Thus, the inability to identify uninformative tokens undermines the supposed efficiency gains from pruning. To address this, we propose Navigation-Aware Pruning (NAP), which uses navigation-specific traits to simplify the pruning process by pre-filtering tokens into foreground and background. For example, image views are filtered based on whether the agent can navigate in that direction. We also extract navigation-relevant instructions using a Large Language Model. After filtering, we focus pruning on background tokens, minimizing information loss. To further help avoid increases in navigation length, we discourage backtracking by removing low-importance navigation nodes. Experiments on standard VLN benchmarks show NAP significantly outperforms prior work, preserving higher success rates while saving more than 50% FLOPS.
中文摘要:本研究提出的导航感知剪枝(NAP)方法通过基于导航特性筛选背景标记进行定向剪枝,在保持高任务成功率的同时实现超过50%的计算效率提升。
English Summary: The proposed Navigation-Aware Pruning (NAP) method enhances vision-and-language navigation efficiency by selectively pruning background tokens using navigation-specific criteria, achieving over 50% computational savings while maintaining high task success rates.

Authors:Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitain Shi, Jiale Wei, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen
Title: MICA: Multi-Agent Industrial Coordination Assistant
Abstract:
Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.
中文:MICA是一种面向工业辅助的语音交互多智能体系统,通过自适应推理与安全审核机制,在保障隐私和硬件限制的前提下,显著提升了任务执行的成功率与可靠性。
English: MICA is a speech-interactive multi-agent system designed for industrial assistance, integrating adaptive reasoning and safety checks to enhance task success and reliability while operating under privacy and hardware constraints.

Authors:Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Title: ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Abstract:
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
Chinese: 视觉感知推测解码(ViSpec)是一种新颖框架,通过轻量级视觉适配器压缩图像标记并融合全局特征,首次实现了视觉语言模型推测解码的显著加速。
English: Vision-Aware Speculative Decoding (ViSpec) is a novel framework that accelerates vision-language models by using a lightweight vision adaptor to compress image tokens and integrating global features, achieving the first substantial speedup in this domain.

Authors:Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Title: ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Abstract:
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
Chinese: 视觉感知推测解码(ViSpec)是一种新颖框架,通过轻量级视觉适配器压缩图像标记并融合全局特征,首次实现了视觉语言模型推测解码的显著加速。
English: Vision-Aware Speculative Decoding (ViSpec) is a novel framework that accelerates vision-language models by using a lightweight vision adaptor to compress image tokens and integrating global features, achieving the first substantial speedup in this domain.

Authors:Abhishek Basu, Fahad Shamshad, Ashshak Sharifdeen, Karthik Nandakumar, Muhammad Haris Khan
Title: Calibration-Aware Prompt Learning for Medical Vision-Language Models
Abstract:
Medical Vision-Language Models (Med-VLMs) have demonstrated remarkable performance across diverse medical imaging tasks by leveraging large-scale image-text pretraining. However, their confidence calibration is largely unexplored, and so remains a significant challenge. As such, miscalibrated predictions can lead to overconfident errors, undermining clinical trust and decision-making reliability. To address this, we introduce CalibPrompt, the first framework to calibrate Med-VLMs during prompt tuning. CalibPrompt optimizes a small set of learnable prompts with carefully designed calibration objectives under scarce labeled data regime. First, we study a regularizer that attempts to align the smoothed accuracy with the predicted model confidences. Second, we introduce an angular separation loss to maximize textual feature proximity toward improving the reliability in confidence estimates of multimodal Med-VLMs. Extensive experiments on four publicly available Med-VLMs and five diverse medical imaging datasets reveal that CalibPrompt consistently improves calibration without drastically affecting clean accuracy. Our code is available at https://github.com/iabh1shekbasu/CalibPrompt.
中文: CalibPrompt是一种创新框架,通过在提示调优中优化医学视觉语言模型的置信度校准,提高了可靠性且不影响准确性。
English: CalibPrompt is a novel framework that enhances the confidence calibration of Medical Vision-Language Models during prompt tuning, improving reliability without significantly compromising accuracy.

Authors:Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi
Title: Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation
Abstract:
We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.
中文: VocAlign提出了一种针对开放词汇语义分割的无源域自适应框架,采用师生范式结合词汇对齐和LoRA微调,在CityScapes数据集上实现了6.11 mIoU提升,同时显著降低了计算和内存需求。
English: VocAlign introduces a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation, using a student-teacher paradigm with vocabulary alignment and LoRA fine-tuning to achieve a 6.11 mIoU improvement on CityScapes while reducing computational and memory demands.

Authors:Silvio Mazzucco, Carl Persson, Mattia Segu, Pier Luigi Dovesi, Federico Tombari, Luc Van Gool, Matteo Poggi
Title: Lost in Translation? Vocabulary Alignment for Source-Free Adaptation in Open-Vocabulary Semantic Segmentation
Abstract:
We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.
中文: VocAlign提出了一种针对开放词汇语义分割的无源域自适应框架,采用师生范式结合词汇对齐和LoRA微调,在CityScapes数据集上实现了6.11 mIoU提升,同时显著降低了计算和内存需求。
English: VocAlign introduces a source-free domain adaptation framework for VLMs in open-vocabulary semantic segmentation, using a student-teacher paradigm with vocabulary alignment and LoRA fine-tuning to achieve a 6.11 mIoU improvement on CityScapes while reducing computational and memory demands.

Authors:Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, Stefano Mattoccia
Title: Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Abstract:
Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.
中文: 本文提出一种跨模态蒸馏方法,利用视觉基础模型从事件数据生成密集代理深度标签,无需昂贵标注即可实现与全监督方法相媲美的单目深度估计性能,并达到当前最优水平。
English: This paper introduces a cross-modal distillation approach using Vision Foundation Models to generate dense proxy depth labels from event data, enabling competitive monocular depth estimation without costly annotations while achieving state-of-the-art results.

Authors:Zhaoyang Liu, Jingjing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, Shenglong Ye, Qingyun Li, Xuan Dong, Yue Yu, Chenyu Lu, YunXiang Mo, Yao Yan, Zeyue Tian, Xiao Zhang, Yuan Huang, Yiqian Liu, Weijie Su, Gen Luo, Xiangyu Yue, Biqing Qi, Kai Chen, Bowen Zhou, Yu Qiao, Qifeng Chen, Wenhai Wang
Title: ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Abstract:
Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.
中文:ScaleCUA推出了大规模数据集和模型,支持计算机使用代理跨平台操作,并在多项基准测试中创下最新性能记录。
English: ScaleCUA introduces a large-scale dataset and model for computer use agents, enabling cross-platform operation and achieving state-of-the-art performance across multiple benchmarks.

Authors:Fangjinhua Wang, Qingshan Xu, Yew-Soon Ong, Marc Pollefeys
Title: Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model
Abstract:
To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D. Code is available at: https://github.com/cvg/diffmvs.
Chinese: 本文提出了一种新颖的多视图立体框架,通过引入扩散模型将深度优化构建为条件去噪过程,所提出的DiffMVS和CasDiffMVS方法在运行效率和三维重建精度上均达到了业界领先水平。
English: This paper introduces a novel multi-view stereo framework that integrates diffusion models to refine depth estimation through a conditional denoising process, achieving state-of-the-art efficiency and performance with two proposed methods, DiffMVS and CasDiffMVS.

Authors:Ruijie Hou, Yueyang Jiao, Hanxu Hu, Yingming Li, Wai Lam, Huajian Zhang, Hongyuan Lu
Title: LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
Abstract:
The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes it hard to benchmark LLMs fairly. Instead of constructing contamination-free datasets (quite hard), we propose a novel framework, \textbf{LNE-Blocking}, to restore model performance prior to contamination on potentially leaked datasets. Our framework consists of two components: contamination detection and disruption operation. For the prompt, the framework first uses the contamination detection method, \textbf{LNE}, to assess the extent of contamination in the model. Based on this, it adjusts the intensity of the disruption operation, \textbf{Blocking}, to elicit non-memorized responses from the model. Our framework is the first to efficiently restore the model's greedy decoding performance. This comes with a strong performance on multiple datasets with potential leakage risks, and it consistently achieves stable recovery results across different models and varying levels of data contamination. We release the code at https://github.com/RuijieH/LNE-Blocking to facilitate research.
中文: 本文提出LNE-Blocking新框架,通过检测大型语言模型中的数据污染并实施干扰操作,无需构建无污染数据集即可恢复模型原始性能。
English: The paper introduces LNE-Blocking, a novel framework that detects data contamination in large language models and applies disruption operations to restore their original performance without requiring contamination-free datasets.

Authors:Sreejato Chatterjee, Linh Tran, Quoc Duy Nguyen, Roni Kirson, Drue Hamlin, Harvest Aquino, Hanjia Lyu, Jiebo Luo, Timothy Dye
Title: Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Abstract:
Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/llm-oppression-benchmark).
中文: 本研究提出了一种利用大语言模型生成历史压迫情境敏感评分的新框架,提供了一种可扩展工具,能够捕捉不同地缘政治背景下基于身份的细微排斥现象。
English: This study introduces a novel framework using Large Language Models to generate context-sensitive scores of historical oppression, offering a scalable tool that captures nuanced identity-based exclusion across diverse geopolitical settings.

Authors:Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
Title: RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Abstract:
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.
中文: 本文提出RynnVLA-001视觉语言动作模型,采用结合视频生成与轨迹预测的双阶段预训练方法,并通过ActionVAE压缩动作表示,在机器人任务中实现了最优性能。
English: This paper introduces RynnVLA-001, a vision-language-action model that employs a novel two-stage pretraining approach combining video generation and trajectory prediction, enhanced by an ActionVAE for compact action representation, achieving state-of-the-art performance in robotics tasks.

Authors:Pierre Fernandez, Tomáš Souček, Nikola Jovanović, Hady Elsahar, Sylvestre-Alvise Rebuffi, Valeriu Lacatusu, Tuan Tran, Alexandre Mourachko
Title: Geometric Image Synchronization with Deep Watermarking
Abstract:
Synchronization is the task of estimating and inverting geometric transformations (e.g., crop, rotation) applied to an image. This work introduces SyncSeal, a bespoke watermarking method for robust image synchronization, which can be applied on top of existing watermarking methods to enhance their robustness against geometric transformations. It relies on an embedder network that imperceptibly alters images and an extractor network that predicts the geometric transformation to which the image was subjected. Both networks are end-to-end trained to minimize the error between the predicted and ground-truth parameters of the transformation, combined with a discriminator to maintain high perceptual quality. We experimentally validate our method on a wide variety of geometric and valuemetric transformations, demonstrating its effectiveness in accurately synchronizing images. We further show that our synchronization can effectively upgrade existing watermarking methods to withstand geometric transformations to which they were previously vulnerable.
中文: SyncSeal是一种定制水印方法,通过嵌入器和提取器网络预测并逆转几何变换,有效提升现有水印技术对抗此类干扰的鲁棒性,同时保持图像质量。
English: SyncSeal is a specialized watermarking technique that enhances the robustness of existing methods against geometric transformations by using embedder and extractor networks to predict and invert these changes while maintaining image quality.

Authors:Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu
Title: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Abstract:
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., GPQA, MMLU-Pro, and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.
中文摘要:EVOL-RL是一种新颖的自改进框架,通过结合多数投票的稳定性和新颖性感知的探索,有效防止语言模型的熵崩溃,显著提升了领域内性能和跨领域泛化能力。
English Summary: EVOL-RL is a novel self-improvement framework that prevents entropy collapse in language models by combining majority-voted stability with novelty-aware exploration, significantly enhancing both in-domain performance and out-of-domain generalization.

Authors:Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau
Title: Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Abstract:
Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as \textit{grounding tokens}, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (\textit{e.g.}, attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.
中文摘要:本文提出了一种基于多模态大语言模型的零样本时空视频定位框架,通过创新的解耦时空高亮和时间增强组装策略来提升模型推理能力,在多个基准测试中实现了最优性能。
English Summary: This paper introduces a zero-shot framework for spatio-temporal video grounding using multimodal large language models, employing novel decomposed spatio-temporal highlighting and temporal-augmented assembling strategies to enhance reasoning capabilities and achieve state-of-the-art performance on benchmarks.

Authors:Ali Nazari, Bardiya Kariminia, Mohsen Ebrahimi Moghaddam
Title: A Race Bias Free Face Aging Model for Reliable Kinship Verification
Abstract:
The age gap in kinship verification addresses the time difference between the photos of the parent and the child. Moreover, their same-age photos are often unavailable, and face aging models are racially biased, which impacts the likeness of photos. Therefore, we propose a face aging GAN model, RA-GAN, consisting of two new modules, RACEpSp and a feature mixer, to produce racially unbiased images. The unbiased synthesized photos are used in kinship verification to investigate the results of verifying same-age parent-child images. The experiments demonstrate that our RA-GAN outperforms SAM-GAN on an average of 13.14\% across all age groups, and CUSP-GAN in the 60+ age group by 9.1\% in terms of racial accuracy. Moreover, RA-GAN can preserve subjects' identities better than SAM-GAN and CUSP-GAN across all age groups. Additionally, we demonstrate that transforming parent and child images from the KinFaceW-I and KinFaceW-II datasets to the same age can enhance the verification accuracy across all age groups. The accuracy increases with our RA-GAN for the kinship relationships of father-son and father-daughter, mother-son, and mother-daughter, which are 5.22, 5.12, 1.63, and 0.41, respectively, on KinFaceW-I. Additionally, the accuracy for the relationships of father-daughter, father-son, and mother-son is 2.9, 0.39, and 1.6 on KinFaceW-II, respectively. The code is available at~\href{https://github.com/bardiya2254kariminia/An-Age-Transformation-whitout-racial-bias-for-Kinship-verification}{Github}
中文: 本研究提出了RA-GAN这一无种族偏见的面部老化模型,通过生成同龄亲子图像显著提升了亲属关系验证的准确性,并在身份特征保留方面优于现有技术。
English: The study introduces RA-GAN, a racially unbiased face aging model that enhances kinship verification by generating same-age parent-child images, achieving superior accuracy and identity preservation compared to existing methods.

Authors:Pak-Hei Yeung, Jayroop Ramesh, Pengfei Lyu, Ana Namburete, Jagath Rajapakse
Title: Semi-Supervised 3D Medical Segmentation from 2D Natural Images Pretrained Model
Abstract:
This paper explores the transfer of knowledge from general vision models pretrained on 2D natural images to improve 3D medical image segmentation. We focus on the semi-supervised setting, where only a few labeled 3D medical images are available, along with a large set of unlabeled images. To tackle this, we propose a model-agnostic framework that progressively distills knowledge from a 2D pretrained model to a 3D segmentation model trained from scratch. Our approach, M&N, involves iterative co-training of the two models using pseudo-masks generated by each other, along with our proposed learning rate guided sampling that adaptively adjusts the proportion of labeled and unlabeled data in each training batch to align with the models' prediction accuracy and stability, minimizing the adverse effect caused by inaccurate pseudo-masks. Extensive experiments on multiple publicly available datasets demonstrate that M&N achieves state-of-the-art performance, outperforming thirteen existing semi-supervised segmentation approaches under all different settings. Importantly, ablation studies show that M&N remains model-agnostic, allowing seamless integration with different architectures. This ensures its adaptability as more advanced models emerge. The code is available at https://github.com/pakheiyeung/M-N.
中文: 本文提出M&N框架,通过迭代协同训练和自适应数据采样,将2D预训练视觉模型的知识迁移至3D医学图像分割,在半监督设定下实现了最优性能。
English: This paper introduces M&N, a model-agnostic framework that transfers knowledge from 2D pretrained vision models to enhance 3D medical image segmentation through iterative co-training and adaptive data sampling, achieving state-of-the-art results in semi-supervised settings.

Authors:Shiwan Zhao, Xuyang Zhao, Jiaming Zhou, Aobo Kong, Qicheng Li, Yong Qin
Title: Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning
Abstract:
Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem, where expert demonstrations come from a fixed behavior policy while training aims to optimize a target policy. Importance sampling is the standard tool for correcting this distribution mismatch, but large policy gaps lead to skewed weights, high variance, and unstable optimization. Existing methods mitigate this issue with KL penalties or clipping, which passively restrict updates rather than actively reducing the gap. We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap before training. For each problem, correct model-generated solutions are kept as on-policy data, while incorrect ones are rewritten through guided re-solving, falling back to expert demonstrations only when needed. This aligns the training distribution with the target policy, reducing variance and improving stability. To handle residual mismatch after rewriting, we additionally apply importance sampling during training, forming a two-stage approach that combines data-level alignment with lightweight optimization-level correction. Experiments on five mathematical reasoning benchmarks show consistent and significant gains over both vanilla SFT and the state-of-the-art Dynamic Fine-Tuning (DFT) approach. Data and code will be released at https://github.com/NKU-HLT/Off-Policy-SFT.
中文摘要:本文提出一种两阶段数据重写框架,在训练前主动缩小策略差距并在训练中应用重要性采样,在数学推理基准上相比现有方法实现了显著性能提升。
English Summary: This paper introduces a two-stage data rewriting framework that proactively reduces the policy gap before training and applies importance sampling during training, achieving significant performance improvements on mathematical reasoning benchmarks over existing methods.

Authors:Gengliang Li, Rongyu Chen, Bin Li, Linlin Yang, Guodong Ding
Title: MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation
Abstract:
Ensuring factual consistency and reliable reasoning remains a critical challenge for medical vision-language models. We introduce MEDFACT-R1, a two-stage framework that integrates external knowledge grounding with reinforcement learning to improve the factual medical reasoning. The first stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external factual expertise; while the second stage applies Group Relative Policy Optimization (GRPO) with four tailored factual reward signals to encourage self-consistent reasoning. Across three public medical QA benchmarks, MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over previous state-of-the-art methods. Ablation studies highlight the necessity of pseudo-label SFT cold start and validate the contribution of each GRPO reward, underscoring the synergy between knowledge grounding and RL-driven reasoning for trustworthy medical AI. Codes are released at https://github.com/Garfieldgengliang/MEDFACT-R1.
中文: MEDFACT-R1通过整合外部知识基础与强化学习的两阶段框架,显著提升医学视觉语言模型的事实准确性,相比现有最优方法绝对改进幅度达22.5%。
English: MEDFACT-R1 is a two-stage framework combining external knowledge grounding with reinforcement learning to significantly enhance factual accuracy in medical vision-language models, achieving up to 22.5% improvement over prior methods.

Authors:Stelios Katsis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou
Title: Exploring How Audio Effects Alter Emotion with Foundation Models
Abstract:
Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.
中文摘要:本研究探讨了如何利用基础模型分析音频效果在音乐中的情感影响,通过深度学习探针方法揭示了声音设计技术与情感反应之间的复杂关系。
English Summary: This study explores how foundation models can analyze the emotional impact of audio effects in music, revealing complex relationships between sound design techniques and affective responses through advanced deep learning probing methods.

Authors:Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong
Title: A1: Asynchronous Test-Time Scaling via Conformal Prediction
Abstract:
Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative decoding with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable LLM inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.
Chinese: ATTS作为一种异步测试时扩展框架,通过并行与序列双维度加速大语言模型,在保持准确性的同时实现高达56.7倍的加速比和4.14倍的吞吐量提升。
English: ATTS is an asynchronous test-time scaling framework that accelerates large language models by enabling parallel and sequential scaling, achieving up to 56.7x speedup and 4.14x throughput improvement without accuracy loss.

Authors:Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong
Title: ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Abstract:
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency. Speculative decoding is a natural way to accelerate the scaling process; however, scaling along both the parallel and sequential dimensions poses significant challenges, including substantial memory-bound execution and synchronization overhead. We introduce ATTS (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive scaling framework that follows the hypothesis testing process to address these challenges. By revisiting arithmetic intensity, ATTS identifies synchronization as the primary bottleneck. It enables asynchronous inference through online calibration and proposes an ordinal classification algorithm that supports a three-stage rejection sampling pipeline, scaling along both the sequential and parallel axes. Across experiments on the MATH, AMC23, AIME24, and AIME25 datasets and across multiple draft-target model families, we show that ATTS delivers up to 56.7x speedup in test-time scaling and a 4.14x throughput improvement, while maintaining accurate control of the rejection rate, reducing latency and memory overhead, and incurring no accuracy loss. By scaling both in parallel and sequential dimensions, we enable the 1.5B/70B draft/target model combination to achieve the performance of the state-of-the-art reasoning model o3-mini (high) on the AIME dataset. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.
Chinese: ATTS作为一种异步测试时扩展框架,通过并行与序列双维度加速大语言模型,在保持准确性的同时实现高达56.7倍的加速比和4.14倍的吞吐量提升。
English: ATTS is an asynchronous test-time scaling framework that accelerates large language models by enabling parallel and sequential scaling, achieving up to 56.7x speedup and 4.14x throughput improvement without accuracy loss.

Authors:Yuxin Luo, Ruoyi Zhang, Lu-Chuan Liu, Tianyu Li, Hangyu Liu
Title: FCPE: A Fast Context-based Pitch Estimation Model
Abstract:
Pitch estimation (PE) in monophonic audio is crucial for MIDI transcription and singing voice conversion (SVC), but existing methods suffer significant performance degradation under noise. In this paper, we propose FCPE, a fast context-based pitch estimation model that employs a Lynx-Net architecture with depth-wise separable convolutions to effectively capture mel spectrogram features while maintaining low computational cost and robust noise tolerance. Experiments show that our method achieves 96.79\% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods. The Real-Time Factor (RTF) is 0.0062 on a single RTX 4090 GPU, which significantly outperforms existing algorithms in efficiency. Code is available at https://github.com/CNChTu/FCPE.
中文: 本文提出的FCPE模型采用基于Lynx-Net的深度可分离卷积架构,在保持高鲁棒性和低计算成本的同时,实现了与最优方法相当的96.79%原始音高准确率,并以0.0062的实时因子显著超越现有算法的效率。
English: The proposed FCPE model utilizes a Lynx-Net with depth-wise separable convolutions to achieve robust pitch estimation with high noise tolerance and computational efficiency, matching state-of-the-art accuracy at 96.79% RPA while significantly outperforming existing methods in speed with an RTF of 0.0062.

Authors:Dan Zhang, Min Cai, Jonathan Li, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang
Title: TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference
Abstract:
Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences during training. This temporal-difference (TD) regularization produces smooth rewards and improves alignment with long-term objectives. Incorporating TDRM into the actor-critic style online RL loop yields consistent empirical gains. It is worth noting that TDRM is a supplement to verifiable reward methods, and both can be used in series. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies on 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.
中文: TDRM通过最小化时序差异提升奖励模型的稳定性,在Best-of-N和树搜索任务中表现更优,并以仅需2.5k数据实现基线方法50.1k数据的效果,显著提高了多款语言模型的强化学习效率。
English: TDRM enhances reward model consistency by minimizing temporal differences, improving performance in Best-of-N and tree-search tasks while enabling more data-efficient reinforcement learning across multiple language models.

Authors:Dan Zhang, Min Cai, Jonathan Light, Ziniu Hu, Yisong Yue, Jie Tang
Title: TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference
Abstract:
Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences (TD) for training-time reinforcement learning and inference-time verification. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies in 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.
中文: TDRM通过最小化时序差异提升奖励模型的稳定性,在Best-of-N和树搜索任务中表现更优,并以仅需2.5k数据实现基线方法50.1k数据的效果,显著提高了多款语言模型的强化学习效率。
English: TDRM enhances reward model consistency by minimizing temporal differences, improving performance in Best-of-N and tree-search tasks while enabling more data-efficient reinforcement learning across multiple language models.

Authors:Liran Nochumsohn, Raz Marshanski, Hedi Zisling, Omri Azencot
Title: Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting
Abstract:
Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, We introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear matches state-of-the-art performance while offering superior efficiency, robustness to various sampling rates, and enhanced interpretability. The implementation of Super-Linear is available at \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}
中文: Super-Linear 是一种轻量级、可扩展的专家混合模型,通过频率专用线性专家和谱门控机制取代复杂架构,在时间序列预测中实现了与顶尖模型相当的性能,同时具备更高的效率、鲁棒性和可解释性。
English: Super-Linear is a lightweight and scalable mixture-of-experts model that replaces complex architectures with frequency-specialized linear experts and a spectral gating mechanism, achieving state-of-the-art performance with superior efficiency, robustness, and interpretability in time series forecasting.

Authors:Hongyao Tu, Liang Zhang, Yujie Lin, Xin Lin, Haibo Zhang, Long Zhang, Jinsong Su
Title: LLM-OREF: An Open Relation Extraction Framework Based on Large Language Models
Abstract:
The goal of open relation extraction (OpenRE) is to develop an RE model that can generalize to new relations not encountered during training. Existing studies primarily formulate OpenRE as a clustering task. They first cluster all test instances based on the similarity between the instances, and then manually assign a new relation to each cluster. However, their reliance on human annotation limits their practicality. In this paper, we propose an OpenRE framework based on large language models (LLMs), which directly predicts new relations for test instances by leveraging their strong language understanding and generation abilities, without human intervention. Specifically, our framework consists of two core components: (1) a relation discoverer (RD), designed to predict new relations for test instances based on \textit{demonstrations} formed by training instances with known relations; and (2) a relation predictor (RP), used to select the most likely relation for a test instance from $n$ candidate relations, guided by \textit{demonstrations} composed of their instances. To enhance the ability of our framework to predict new relations, we design a self-correcting inference strategy composed of three stages: relation discovery, relation denoising, and relation prediction. In the first stage, we use RD to preliminarily predict new relations for all test instances. Next, we apply RP to select some high-reliability test instances for each new relation from the prediction results of RD through a cross-validation method. During the third stage, we employ RP to re-predict the relations of all test instances based on the demonstrations constructed from these reliable test instances. Extensive experiments on three OpenRE datasets demonstrate the effectiveness of our framework. We release our code at https://github.com/XMUDeepLIT/LLM-OREF.git.
中文: 本文提出了一种基于大语言模型的开放关系抽取框架,通过包含关系发现、去噪和预测的自校正推理策略,无需人工干预即可自动预测新关系。
English: This paper introduces a novel open relation extraction framework using large language models that autonomously predicts new relations through a self-correcting inference process, eliminating the need for human annotation.

Authors:Lukas Silvester Barth, Paulo von Petersenn
Title: Probabilistic and nonlinear compressive sensing
Abstract:
We present a smooth probabilistic reformulation of $\ell_0$ regularized regression that does not require Monte Carlo sampling and allows for the computation of exact gradients, facilitating rapid convergence to local optima of the best subset selection problem. The method drastically improves convergence speed compared to similar Monte Carlo based approaches. Furthermore, we empirically demonstrate that it outperforms compressive sensing algorithms such as IHT and (Relaxed-) Lasso across a wide range of settings and signal-to-noise ratios. The implementation runs efficiently on both CPUs and GPUs and is freely available at https://github.com/L0-and-behold/probabilistic-nonlinear-cs. We also contribute to research on nonlinear generalizations of compressive sensing by investigating when parameter recovery of a nonlinear teacher network is possible through compression of a student network. Building upon theorems of Fefferman and Markel, we show theoretically that the global optimum in the infinite-data limit enforces recovery up to certain symmetries. For empirical validation, we implement a normal-form algorithm that selects a canonical representative within each symmetry class. However, while compression can help to improve test loss, we find that exact parameter recovery is not even possible up to symmetries. In particular, we observe a surprising rebound effect where teacher and student configurations initially converge but subsequently diverge despite continuous decrease in test loss. These findings indicate fundamental differences between linear and nonlinear compressive sensing.
中文: 本文提出了一种平滑的概率化ℓ₀正则回归方法,无需蒙特卡洛采样即可计算精确梯度并实现快速收敛,在多种实验设置下均优于IHT和Lasso等压缩感知算法;同时在线性压缩感知的拓展研究中发现,非线性场景下的参数恢复存在对称性约束和测试损失下降时参数反而发散的反弹现象。
English: This paper introduces a smooth probabilistic method for ℓ₀ regularized regression that enables exact gradient computation and faster convergence than Monte Carlo approaches, outperforming compressive sensing algorithms like IHT and Lasso across various settings while also exploring nonlinear compressive sensing where parameter recovery faces symmetry challenges and unexpected divergence despite decreasing test loss.

Authors:Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dimitris Metaxas, David Doermann
Title: AutoEdit: Automatic Hyperparameter Tuning for Image Editing
Abstract:
Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification. This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world. Codes can be found at https://github.com/chaupham1709/AutoEdit.git.
中文摘要:该研究提出了一种强化学习框架,在扩散去噪过程中动态优化图像编辑的超参数,相比暴力搜索方法显著降低了计算成本和搜索时间。
English Summary: The study introduces a reinforcement learning framework that dynamically optimizes hyperparameters in diffusion-based image editing, significantly reducing computational costs and search time compared to brute-force methods.

Authors:Shenghao Zhu, Yifei Chen, Weihong Chen, Shuo Jiang, Guanyu Zhou, Yuanhan Wang, Feiwei Qin, Changmiao Wang, Qiyuan Tian
Title: No Modality Left Behind: Adapting to Missing Modalities via Knowledge Distillation for Brain Tumor Segmentation
Abstract:
Accurate brain tumor segmentation is essential for preoperative evaluation and personalized treatment. Multi-modal MRI is widely used due to its ability to capture complementary tumor features across different sequences. However, in clinical practice, missing modalities are common, limiting the robustness and generalizability of existing deep learning methods that rely on complete inputs, especially under non-dominant modality combinations. To address this, we propose AdaMM, a multi-modal brain tumor segmentation framework tailored for missing-modality scenarios, centered on knowledge distillation and composed of three synergistic modules. The Graph-guided Adaptive Refinement Module explicitly models semantic associations between generalizable and modality-specific features, enhancing adaptability to modality absence. The Bi-Bottleneck Distillation Module transfers structural and textural knowledge from teacher to student models via global style matching and adversarial feature alignment. The Lesion-Presence-Guided Reliability Module predicts prior probabilities of lesion types through an auxiliary classification task, effectively suppressing false positives under incomplete inputs. Extensive experiments on the BraTS 2018 and 2024 datasets demonstrate that AdaMM consistently outperforms existing methods, exhibiting superior segmentation accuracy and robustness, particularly in single-modality and weak-modality configurations. In addition, we conduct a systematic evaluation of six categories of missing-modality strategies, confirming the superiority of knowledge distillation and offering practical guidance for method selection and future research. Our source code is available at https://github.com/Quanato607/AdaMM.
中文摘要:提出的AdaMM框架通过知识蒸馏和三个协同模块,有效解决了多模态MRI中模态缺失的脑肿瘤分割问题,在BraTS数据集上相比现有方法展现出更优的分割精度和鲁棒性。
English Summary: The proposed AdaMM framework enhances brain tumor segmentation in missing-modality MRI scenarios through knowledge distillation and three synergistic modules, demonstrating superior accuracy and robustness on BraTS datasets compared to existing methods.

Authors:Facundo Domínguez, Arnaud Spiwack
Title: Refinement-Types Driven Development: A study
Abstract:
This paper advocates for the broader application of SMT solvers in everyday programming, challenging the conventional wisdom that these tools are solely for formal methods and verification. We claim that SMT solvers, when seamlessly integrated into a compiler's static checks, significantly enhance the capabilities of ordinary type checkers in program composition. Specifically, we argue that refinement types, as embodied by Liquid Haskell, enable the use of SMT solvers in mundane programming tasks. Through a case study on handling binder scopes in compilers, we envision a future where ordinary programming is made simpler and more enjoyable with the aid of refinement types and SMT solvers. As a secondary contribution, we present a prototype implementation of a theory of finite maps for Liquid Haskell's solver, developed to support our case study.
中文摘要:本文主张通过精化类型(如Liquid Haskell)将SMT求解器融入日常编程,能强化类型检查并简化编译器绑定作用域处理等常规任务。
English Summary: This paper promotes integrating SMT solvers into daily programming through refinement types like Liquid Haskell, enhancing type checking and simplifying tasks such as compiler binder scope management.

Authors:Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng, Yanyuan Qiao, Imran Razzak, Yutong Xie
Title: A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making
Abstract:
Medical decision-making often involves integrating knowledge from multiple clinical specialties, typically achieved through multidisciplinary teams. Inspired by this collaborative process, recent work has leveraged large language models (LLMs) in multi-agent collaboration frameworks to emulate expert teamwork. While these approaches improve reasoning through agent interaction, they are limited by static, pre-assigned roles, which hinder adaptability and dynamic knowledge integration. To address these limitations, we propose KAMAC, a Knowledge-driven Adaptive Multi-Agent Collaboration framework that enables LLM agents to dynamically form and expand expert teams based on the evolving diagnostic context. KAMAC begins with one or more expert agents and then conducts a knowledge-driven discussion to identify and fill knowledge gaps by recruiting additional specialists as needed. This supports flexible, scalable collaboration in complex clinical scenarios, with decisions finalized through reviewing updated agent comments. Experiments on two real-world medical benchmarks demonstrate that KAMAC significantly outperforms both single-agent and advanced multi-agent methods, particularly in complex clinical scenarios (i.e., cancer prognosis) requiring dynamic, cross-specialty expertise. Our code is publicly available at: https://github.com/XiaoXiao-Woo/KAMAC.
中文: 本文提出KAMAC知识驱动自适应多智能体协作框架,通过动态组建专家团队克服静态角色分配局限,在癌症预后等复杂医疗场景中显著优于现有方法。
English: This paper introduces KAMAC, a Knowledge-driven Adaptive Multi-Agent Collaboration framework that dynamically forms expert teams using large language models to address limitations in static role assignments, significantly outperforming existing methods in complex medical scenarios like cancer prognosis.

Authors:Chaoyin She, Ruifang Lu, Lida Chen, Wei Wang, Qinghua Huang
Title: EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Abstract:
Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.
中文: EchoVLM是一种专为超声医学影像设计的视觉语言模型,采用混合专家架构,在七个解剖区域数据上训练,显著提升了超声报告生成和多任务诊断性能,为临床应用提供了可行的技术解决方案。
English: EchoVLM is a specialized vision-language model using a Mixture of Experts architecture that significantly improves ultrasound report generation and multi-task diagnostics across seven anatomical regions, enhancing diagnostic accuracy for clinical applications.

Authors:Xingwu Zhang, Guanxuan Li, Zhuocheng Zhang, Zijun Long
Title: RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching
Abstract:
The rapidly growing number of product categories in large-scale e-commerce makes accurate object identification for automated packing in warehouses substantially more difficult. As the catalog grows, intra-class variability and a long tail of rare or visually similar items increase, and when combined with diverse packaging, cluttered containers, frequent occlusion, and large viewpoint changes-these factors amplify discrepancies between query and reference images, causing sharp performance drops for methods that rely solely on 2D appearance features. Thus, we propose RoboEye, a two-stage identification framework that dynamically augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters to bridge training deployment gaps. In the first stage, we train a large vision model to extract 2D features for generating candidate rankings. A lightweight 3D-feature-awareness module then estimates 3D feature quality and predicts whether 3D re-ranking is necessary, preventing performance degradation and avoiding unnecessary computation. When invoked, the second stage uses our robot 3D retrieval transformer, comprising a 3D feature extractor that produces geometry-aware dense features and a keypoint-based matcher that computes keypoint-correspondence confidences between query and reference images instead of conventional cosine-similarity scoring. Experiments show that RoboEye improves Recall@1 by 7.1% over the prior state of the art (RoboLLM). Moreover, RoboEye operates using only RGB images, avoiding reliance on explicit 3D inputs and reducing deployment costs. The code used in this paper is publicly available at: https://github.com/longkukuhi/RoboEye.
中文: RoboEye是一个两阶段识别框架,通过结合3D推理增强2D语义特征,提升电商仓库中的物体识别准确率,在仅使用RGB图像降低成本的同时,将Recall@1指标较现有最佳方法提高了7.1%。
English: RoboEye is a two-stage identification framework that enhances 2D semantic features with 3D reasoning to improve object recognition in e-commerce warehouses, achieving a 7.1% increase in Recall@1 over previous methods while using only RGB images to reduce costs.

Authors:Zhuokang Shen, Kaisen Zhang, Bohan Jia, Yuan Fang, Zhou Yu, Shaohui Lin
Title: DF-LLaVA: Unlocking MLLM's potential for Synthetic Image Detection via Prompt-Guided Knowledge Injection
Abstract:
With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a simple yet effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first extracts latent knowledge from MLLMs and then injects it into training via prompts. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of our DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.
中文: DF-LLaVA通过提取多模态大模型的潜在知识并结合提示训练,开发了一个既能超越专家模型检测精度、又能保持可解释性的合成图像检测框架。
English: DF-LLaVA is a novel framework that enhances MLLMs' discrimination ability through latent knowledge extraction and prompt-based training, achieving superior detection accuracy and interpretability in synthetic image detection.

Authors:Bingsong Bai, Qihang Lu, Wenbing Yang, Zihan Sun, Yueran Hou, Peilei Jia, Songbai Pu, Ruibo Fu, Yingming Gao, Ya Li, Jun Gao
Title: SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding
Abstract:
Paralinguistic sounds, like laughter and sighs, are crucial for synthesizing more realistic and engaging speech. However, existing methods typically depend on proprietary datasets, while publicly available resources often suffer from incomplete speech, inaccurate or missing timestamps, and limited real-world relevance. To address these problems, we propose an automated framework for generating large-scale paralinguistic data and apply it to construct the SynParaSpeech dataset. The dataset comprises 6 paralinguistic categories with 118.75 hours of data and precise timestamps, all derived from natural conversational speech. Our contributions lie in introducing the first automated method for constructing large-scale paralinguistic datasets and releasing the SynParaSpeech corpus, which advances speech generation through more natural paralinguistic synthesis and enhances speech understanding by improving paralinguistic event detection. The dataset and audio samples are available at https://github.com/ShawnPi233/SynParaSpeech.
Chinese: 本文提出了一种自动构建副语言数据集的框架,并发布了SynParaSpeech语料库,该库包含精确时间戳的大规模副语言声音,旨在提升语音合成的自然度和副语言事件检测的准确性。
English: This paper introduces an automated framework for creating the SynParaSpeech dataset, which provides a large-scale collection of paralinguistic sounds with precise timestamps to enhance both speech synthesis and understanding.

Authors:Bingsong Bai, Qihang Lu, Wenbing Yang, Zihan Sun, Yueran Hou, Peilei Jia, Songbai Pu, Ruibo Fu, Yingming Gao, Ya Li, Jun Gao
Title: SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding
Abstract:
Paralinguistic sounds, like laughter and sighs, are crucial for synthesizing more realistic and engaging speech. However, existing methods typically depend on proprietary datasets, while publicly available resources often suffer from incomplete speech, inaccurate or missing timestamps, and limited real-world relevance. To address these problems, we propose an automated framework for generating large-scale paralinguistic data and apply it to construct the SynParaSpeech dataset. The dataset comprises 6 paralinguistic categories with 118.75 hours of data and precise timestamps, all derived from natural conversational speech. Our contributions lie in introducing the first automated method for constructing large-scale paralinguistic datasets and releasing the SynParaSpeech corpus, which advances speech generation through more natural paralinguistic synthesis and enhances speech understanding by improving paralinguistic event detection. The dataset and audio samples are available at https://github.com/ShawnPi233/SynParaSpeech.
Chinese: 本文提出了一种自动构建副语言数据集的框架,并发布了SynParaSpeech语料库,该库包含精确时间戳的大规模副语言声音,旨在提升语音合成的自然度和副语言事件检测的准确性。
English: This paper introduces an automated framework for creating the SynParaSpeech dataset, which provides a large-scale collection of paralinguistic sounds with precise timestamps to enhance both speech synthesis and understanding.

Authors:Kangdi Wang, Zhiyue Wu, Dinghao Zhou, Rui Lin, Junyu Dai, Tao Jiang
Title: Back to Ear: Perceptually Driven High Fidelity Music Reconstruction
Abstract:
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose εar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives--Instantaneous Frequency and Group Delay--for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show εar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.
中文摘要:εar-VAE模型通过引入K加权感知滤波器、创新的相位损失函数和频谱监督范式,优化了音频重建的听觉感知效果,在重构高频谐波和空间特征方面显著优于现有开源模型。
English Summary: The εar-VAE model enhances audio reconstruction by incorporating auditory perception through K-weighting filters, novel phase losses for stereo coherence, and a spectral supervision paradigm, significantly outperforming existing models in high-frequency harmonics and spatial accuracy.

Authors:Keanu Sisouk, Eloi Tanguy, Julie Delon, Julien Tierny
Title: Robust Barycenters of Persistence Diagrams
Abstract:
This short paper presents a general approach for computing robust Wasserstein barycenters of persistence diagrams. The classical method consists in computing assignment arithmetic means after finding the optimal transport plans between the barycenter and the persistence diagrams. However, this procedure only works for the transportation cost related to the $q$-Wasserstein distance $W_q$ when $q=2$. We adapt an alternative fixed-point method to compute a barycenter diagram for generic transportation costs ($q > 1$), in particular those robust to outliers, $q \in (1,2)$. We show the utility of our work in two applications: \emph{(i)} the clustering of persistence diagrams on their metric space and \emph{(ii)} the dictionary encoding of persistence diagrams. In both scenarios, we demonstrate the added robustness to outliers provided by our generalized framework. Our Python implementation is available at this address: https://github.com/Keanu-Sisouk/RobustBarycenter .
中文: 本文提出了一种鲁棒的持续性图谱Wasserstein重心计算方法,突破传统q=2的限制,可处理任意q>1的传输成本,特别在聚类和字典编码应用中显著提升了针对异常值的稳健性。
English: This paper introduces a robust method for computing Wasserstein barycenters of persistence diagrams that extends beyond the classical q=2 case to handle generic transportation costs (q>1), particularly enhancing outlier robustness in clustering and dictionary encoding applications.

Authors:Jonas Geiger, Marta Moscati, Shah Nawaz, Markus Schedl
Title: Music4All A+A: A Multimodal Dataset for Music Information Retrieval Tasks
Abstract:
Music is characterized by aspects related to different modalities, such as the audio signal, the lyrics, or the music video clips. This has motivated the development of multimodal datasets and methods for Music Information Retrieval (MIR) tasks such as genre classification or autotagging. Music can be described at different levels of granularity, for instance defining genres at the level of artists or music albums. However, most datasets for multimodal MIR neglect this aspect and provide data at the level of individual music tracks. We aim to fill this gap by providing Music4All Artist and Album (Music4All A+A), a dataset for multimodal MIR tasks based on music artists and albums. Music4All A+A is built on top of the Music4All-Onion dataset, an existing track-level dataset for MIR tasks. Music4All A+A provides metadata, genre labels, image representations, and textual descriptors for 6,741 artists and 19,511 albums. Furthermore, since Music4All A+A is built on top of Music4All-Onion, it allows access to other multimodal data at the track level, including user--item interaction data. This renders Music4All A+A suitable for a broad range of MIR tasks, including multimodal music recommendation, at several levels of granularity. To showcase the use of Music4All A+A, we carry out experiments on multimodal genre classification of artists and albums, including an analysis in missing-modality scenarios, and a quantitative comparison with genre classification in the movie domain. Our experiments show that images are more informative for classifying the genres of artists and albums, and that several multimodal models for genre classification struggle in generalizing across domains. We provide the code to reproduce our experiments at https://github.com/hcai-mms/Music4All-A-A, the dataset is linked in the repository and provided open-source under a CC BY-NC-SA 4.0 license.
Chinese Summary: Music4All A+A数据集填补了多模态音乐信息检索中缺乏艺术家和专辑级别数据的空白,支持跨粒度的流派分类和推荐等任务。
English Summary: The Music4All A+A dataset addresses the gap in multimodal Music Information Retrieval by providing artist and album-level data, enabling tasks like genre classification and recommendation across different granularities.

Authors:Qianyang Li, Xingjun Zhang, Shaoxun Wang, Jia Wei
Title: DPANet: Dual Pyramid Attention Network for Multivariate Time Series Forecasting
Abstract:
Long-term time series forecasting (LTSF) is hampered by the challenge of modeling complex dependencies that span multiple temporal scales and frequency resolutions. Existing methods, including Transformer and MLP-based models, often struggle to capture these intertwined characteristics in a unified and structured manner. We propose the Dual Pyramid Attention Network (DPANet), a novel architecture that explicitly decouples and concurrently models temporal multi-scale dynamics and spectral multi-resolution periodicities. DPANet constructs two parallel pyramids: a Temporal Pyramid built on progressive downsampling, and a Frequency Pyramid built on band-pass filtering. The core of our model is the Cross-Pyramid Fusion Block, which facilitates deep, interactive information exchange between corresponding pyramid levels via cross-attention. This fusion proceeds in a coarse-to-fine hierarchy, enabling global context to guide local representation learning. Extensive experiments on public benchmarks show that DPANet achieves state-of-the-art performance, significantly outperforming prior models. Code is available at https://github.com/hit636/DPANet.
中文: DPANet提出了一种双金字塔架构,通过时序金字塔和频域金字塔结合跨注意力融合,有效建模多尺度动态和周期性,在长期时间序列预测中实现了最优性能。
English: DPANet introduces a dual pyramid architecture with temporal and frequency pyramids, integrated through cross-attention fusion, to effectively model multi-scale dynamics and periodicities, achieving state-of-the-art performance in long-term time series forecasting.

Authors:Duojia Li, Shenghui Lu, Hongchen Pan, Zongyi Zhan, Qingyang Hong, Lin Li
Title: MeanFlowSE: one-step generative speech enhancement via conditional mean flow
Abstract:
Multistep inference is a bottleneck for real-time generative speech enhancement because flow- and diffusion-based systems learn an instantaneous velocity field and therefore rely on iterative ordinary differential equation (ODE) solvers. We introduce MeanFlowSE, a conditional generative model that learns the average velocity over finite intervals along a trajectory. Using a Jacobian-vector product (JVP) to instantiate the MeanFlow identity, we derive a local training objective that directly supervises finite-interval displacement while remaining consistent with the instantaneous-field constraint on the diagonal. At inference, MeanFlowSE performs single-step generation via a backward-in-time displacement, removing the need for multistep solvers; an optional few-step variant offers additional refinement. On VoiceBank-DEMAND, the single-step model achieves strong intelligibility, fidelity, and perceptual quality with substantially lower computational cost than multistep baselines. The method requires no knowledge distillation or external teachers, providing an efficient, high-fidelity framework for real-time generative speech enhancement. The proposed method is open-sourced at https://github.com/liduojia1/MeanFlowSE.
中文:MeanFlowSE是一种通过学习有限间隔速度场实现单步语音增强的生成模型,无需多步求解器即可保持高质量和计算效率。
English: MeanFlowSE is a generative model that enables single-step speech enhancement by learning finite-interval velocity fields, eliminating the need for multistep solvers while maintaining high quality and computational efficiency.

Authors:Qidong Wang, Junjie Hu, Ming Jiang
Title: V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Abstract:
Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines Visual Semantic Editing and Attention Modulating for causal interpretation of VLMs. V-SEAM enables concept-level visual manipulations and identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships. We observe that positive heads are often shared within the same semantic level but vary across levels, while negative heads tend to generalize broadly. Finally, we introduce an automatic method to modulate key head embeddings, demonstrating enhanced performance for both LLaVA and InstructBLIP across three diverse VQA benchmarks. Our data and code are released at: https://github.com/petergit1/V-SEAM.
V-SEAM is a novel framework that enables concept-level visual interventions and attention analysis to causally interpret vision-language models, demonstrating improved performance across multiple VQA benchmarks through targeted attention modulation.
English Summary:

Authors:Humphrey Munn, Brendan Tidd, Peter Böhm, Marcus Gallagher, David Howard
Title: Scalable Multi-Objective Robot Reinforcement Learning through Gradient Conflict Resolution
Abstract:
Reinforcement Learning (RL) robot controllers usually aggregate many task objectives into one scalar reward. While large-scale proximal policy optimisation (PPO) has enabled impressive results such as robust robot locomotion in the real world, many tasks still require careful reward tuning and are brittle to local optima. Tuning cost and sub-optimality grow with the number of objectives, limiting scalability. Modelling reward vectors and their trade-offs can address these issues; however, multi-objective methods remain underused in RL for robotics because of computational cost and optimisation difficulty. In this work, we investigate the conflict between gradient contributions for each objective that emerge from scalarising the task objectives. In particular, we explicitly address the conflict between task-based rewards and terms that regularise the policy towards realistic behaviour. We propose GCR-PPO, a modification to actor-critic optimisation that decomposes the actor update into objective-wise gradients using a multi-headed critic and resolves conflicts based on the objective priority. Our methodology, GCR-PPO, is evaluated on the well-known IsaacLab manipulation and locomotion benchmarks and additional multi-objective modifications on two related tasks. We show superior scalability compared to parallel PPO (p = 0.04), without significant computational overhead. We also show higher performance with more conflicting tasks. GCR-PPO improves on large-scale PPO with an average improvement of 9.5%, with high-conflict tasks observing a greater improvement. The code is available at https://github.com/humphreymunn/GCR-PPO.
中文: 本文提出GCR-PPO方法,通过分解目标梯度并按其优先级解决冲突,在机器人任务中相比标准PPO展现出更优的性能和扩展性。
English: This paper introduces GCR-PPO, a modified reinforcement learning method that resolves conflicts between task objectives by decomposing gradients and prioritizing them, demonstrating superior performance and scalability in robotics tasks compared to standard PPO.

Authors:Hannah Sterz, Fabian David Schmidt, Goran Glavaš, Ivan Vulić
Title: ReCoVeR the Target Language: Language Steering without Sacrificing Task Performance
Abstract:
As they become increasingly multilingual, Large Language Models (LLMs) exhibit more language confusion, i.e., they tend to generate answers in a language different from the language of the prompt or the answer language explicitly requested by the user. In this work, we propose ReCoVeR (REducing language COnfusion in VEctor Representations), a novel lightweight approach for reducing language confusion based on language-specific steering vectors. We first isolate language vectors with the help of multi-parallel corpus and then effectively leverage those vectors for effective LLM steering via fixed (i.e., unsupervised) as well as trainable steering functions. Our extensive evaluation, encompassing three benchmarks and 18 languages, shows that ReCoVeR effectively mitigates language confusion in both monolingual and cross-lingual setups while at the same time -- and in contrast to prior language steering methods -- retaining task performance. Our data code is available at https://github.com/hSterz/recover.
中文摘要:ReCoVeR是一种基于语言特定导向向量的轻量级方法,能有效减少多语言大模型中的语言混淆问题,同时保持任务性能。
English Summary: ReCoVeR is a lightweight method using language-specific steering vectors to effectively reduce language confusion in multilingual Large Language Models while maintaining task performance.

Authors:Yuanyuan Yao, Simon Geirnaert, Tinne Tuytelaars, Alexander Bertrand
Title: Efficient Solutions for Mitigating Initialization Bias in Unsupervised Self-Adaptive Auditory Attention Decoding
Abstract:
Decoding the attended speaker in a multi-speaker environment from electroencephalography (EEG) has attracted growing interest in recent years, with neuro-steered hearing devices as a driver application. Current approaches typically rely on ground-truth labels of the attended speaker during training, necessitating calibration sessions for each user and each EEG set-up to achieve optimal performance. While unsupervised self-adaptive auditory attention decoding (AAD) for stimulus reconstruction has been developed to eliminate the need for labeled data, it suffers from an initialization bias that can compromise performance. Although an unbiased variant has been proposed to address this limitation, it introduces substantial computational complexity that scales with data size. This paper presents three computationally efficient alternatives that achieve comparable performance, but with a significantly lower and constant computational cost. The code for the proposed algorithms is available at https://github.com/YYao-42/Unsupervised_AAD.
Chinese: 本文提出了三种计算效率高的无监督听觉注意解码替代方案,它们在性能上与现有方法相当,但计算成本显著降低且保持恒定,解决了以往方法中初始化偏差和高复杂度的局限性。
English: This paper introduces three computationally efficient alternatives for unsupervised auditory attention decoding that achieve comparable performance to existing methods but with significantly lower and constant computational cost, addressing the limitations of initialization bias and high complexity in previous approaches.

Authors:Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng
Title: Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation
Abstract:
Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.
中文摘要:本研究提出Align3方法,通过测试时审议帮助大语言模型适应不同场景下的动态行为与安全规范,并推出SpecBench基准,证明该方法能以最小成本有效提升规范对齐能力。
English Summary: The study introduces Align3, a lightweight method using test-time deliberation to help large language models adapt to dynamic behavioral and safety specifications across various scenarios, and presents SpecBench, a benchmark demonstrating its effectiveness in improving specification alignment with minimal overhead.

Authors:Shangrong Wu, Yanghong Zhou, Yang Chen, Feng Zhang, P. Y. Mok
Title: Chain-of-Thought Re-ranking for Image Retrieval Tasks
Abstract:
Image retrieval remains a fundamental yet challenging problem in computer vision. While recent advances in Multimodal Large Language Models (MLLMs) have demonstrated strong reasoning capabilities, existing methods typically employ them only for evaluation, without involving them directly in the ranking process. As a result, their rich multimodal reasoning abilities remain underutilized, leading to suboptimal performance. In this paper, we propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address this issue. Specifically, we design a listwise ranking prompt that enables MLLM to directly participate in re-ranking candidate images. This ranking process is grounded in an image evaluation prompt, which assesses how well each candidate aligns with users query. By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making - all of which are essential for accurate image retrieval. To enable structured and fine-grained analysis, we further introduce a query deconstruction prompt, which breaks down the original query into multiple semantic components. Extensive experiments on five datasets demonstrate the effectiveness of our CoTRR method, which achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR). Our code is available at https://github.com/freshfish15/CoTRR .
Chinese: 本文提出链式思维重排序方法,通过多模态大语言模型直接参与图像重排序,结合列表推理和查询解构技术,在多项图像检索任务中实现了最优性能。
English: The paper introduces Chain-of-Thought Re-Ranking (CoTRR), a novel method that leverages Multimodal Large Language Models to directly re-rank images through listwise reasoning and query deconstruction, achieving state-of-the-art results in multiple image retrieval tasks.

Authors:Pengyu Wang, Shaojun Zhou, Chenkun Tan, Xinghao Wang, Wei Huang, Zhen Ye, Zhaowei Li, Botian Jiang, Dong Zhang, Xipeng Qiu
Title: UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets
Abstract:
Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation, powering applications such as visual question answering and text-guided image synthesis. However, progress in unified VLLMs remains constrained by the lack of datasets that fully exploit the synergistic potential between these two core abilities. Existing datasets typically address understanding and generation in isolation, thereby limiting the performance of unified VLLMs. To bridge this critical gap, we introduce a novel dataset construction framework, UnifiedVisual, and present UnifiedVisual-240K, a high-quality dataset meticulously designed to facilitate mutual enhancement between multimodal understanding and generation. UnifiedVisual-240K seamlessly integrates diverse visual and textual inputs and outputs, enabling comprehensive cross-modal reasoning and precise text-to-image alignment. Our dataset encompasses a wide spectrum of tasks and data sources, ensuring rich diversity and addressing key shortcomings of prior resources. Extensive experiments demonstrate that models trained on UnifiedVisual-240K consistently achieve strong performance across a wide range of tasks. Notably, these models exhibit significant mutual reinforcement between multimodal understanding and generation, further validating the effectiveness of our framework and dataset. We believe UnifiedVisual represents a new growth point for advancing unified VLLMs and unlocking their full potential. Our code and datasets is available at https://github.com/fnlp-vision/UnifiedVisual.
Chinese: UnifiedVisual-240K数据集通过整合多样化的多模态任务,弥补了统一视觉语言模型缺乏协同数据集的不足,有效促进了理解与生成能力的相互增强,显著提升了模型在各种应用中的性能。
English: The UnifiedVisual-240K dataset addresses the lack of synergistic datasets for unified vision-language models by integrating diverse multimodal tasks to mutually enhance understanding and generation, significantly boosting model performance across various applications.

Authors:Chenkun Tan, Pengyu Wang, Shaojun Zhou, Botian Jiang, Zhaowei Li, Dong Zhang, Xinghao Wang, Yaqian Zhou, Xipeng Qiu
Title: Decoupled Proxy Alignment: Mitigating Language Prior Conflict for Multimodal Alignment in MLLM
Abstract:
Multimodal large language models (MLLMs) have gained significant attention due to their impressive ability to integrate vision and language modalities. Recent advancements in MLLMs have primarily focused on improving performance through high-quality datasets, novel architectures, and optimized training strategies. However, in this paper, we identify a previously overlooked issue, language prior conflict, a mismatch between the inherent language priors of large language models (LLMs) and the language priors in training datasets. This conflict leads to suboptimal vision-language alignment, as MLLMs are prone to adapting to the language style of training samples. To address this issue, we propose a novel training method called Decoupled Proxy Alignment (DPA). DPA introduces two key innovations: (1) the use of a proxy LLM during pretraining to decouple the vision-language alignment process from language prior interference, and (2) dynamic loss adjustment based on visual relevance to strengthen optimization signals for visually relevant tokens. Extensive experiments demonstrate that DPA significantly mitigates the language prior conflict, achieving superior alignment performance across diverse datasets, model families, and scales. Our method not only improves the effectiveness of MLLM training but also shows exceptional generalization capabilities, making it a robust approach for vision-language alignment. Our code is available at https://github.com/fnlp-vision/DPA.
中文: 本文提出解耦代理对齐(DPA)方法,通过预训练阶段引入代理大语言模型和动态损失调整,有效缓解多模态大语言模型中的语言先验冲突问题,显著提升了视觉-语言对齐效果并在多种数据集上展现出优越的泛化能力。
English: This paper introduces Decoupled Proxy Alignment (DPA), a novel training method that mitigates language prior conflict in multimodal large language models by using a proxy LLM during pretraining and dynamic loss adjustment, leading to improved vision-language alignment and generalization across diverse datasets.

Authors:Kazuma Nagata, Naoshi Kaneko
Title: DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images
Abstract:
Automatic colorization of line drawings has been widely studied to reduce the labor cost of hand-drawn anime production. Deep learning approaches, including image/video generation and feature-based correspondence, have improved accuracy but struggle with occlusions, pose variations, and viewpoint changes. To address these challenges, we propose DACoN, a framework that leverages foundation models to capture part-level semantics, even in line drawings. Our method fuses low-resolution semantic features from foundation models with high-resolution spatial features from CNNs for fine-grained yet robust feature extraction. In contrast to previous methods that rely on the Multiplex Transformer and support only one or two reference images, DACoN removes this constraint, allowing any number of references. Quantitative and qualitative evaluations demonstrate the benefits of using multiple reference images, achieving superior colorization performance. Our code and model are available at https://github.com/kzmngt/DACoN.
中文: DACoN是一种创新框架,通过融合基础模型的部分层级语义与CNN空间特征,解决了自动动漫线稿上色中的遮挡和视角变化难题,并支持多参考图像以实现更优着色效果。
English: DACoN is a novel framework that overcomes limitations in automatic anime line drawing colorization by integrating foundation models for part-level semantics with CNN spatial features, enabling the use of multiple reference images for superior performance.

Authors:Kazuma Nagata, Naoshi Kaneko
Title: DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images
Abstract:
Automatic colorization of line drawings has been widely studied to reduce the labor cost of hand-drawn anime production. Deep learning approaches, including image/video generation and feature-based correspondence, have improved accuracy but struggle with occlusions, pose variations, and viewpoint changes. To address these challenges, we propose DACoN, a framework that leverages foundation models to capture part-level semantics, even in line drawings. Our method fuses low-resolution semantic features from foundation models with high-resolution spatial features from CNNs for fine-grained yet robust feature extraction. In contrast to previous methods that rely on the Multiplex Transformer and support only one or two reference images, DACoN removes this constraint, allowing any number of references. Quantitative and qualitative evaluations demonstrate the benefits of using multiple reference images, achieving superior colorization performance. Our code and model are available at https://github.com/kzmngt/DACoN.
中文: DACoN是一种创新框架,通过融合基础模型的部分层级语义与CNN空间特征,解决了自动动漫线稿上色中的遮挡和视角变化难题,并支持多参考图像以实现更优着色效果。
English: DACoN is a novel framework that overcomes limitations in automatic anime line drawing colorization by integrating foundation models for part-level semantics with CNN spatial features, enabling the use of multiple reference images for superior performance.

Authors:Siyu Yan, Long Zeng, Xuecheng Wu, Chengcheng Han, Kongcheng Zhang, Chong Peng, Xuezhi Cao, Xunliang Cai, Chenjuan Guo
Title: MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
Abstract:
As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.
Chinese: MUSE 是一个全面应对大型语言模型多轮越狱的框架,通过 MUSE-A 利用框架语义和树搜索进行攻击,以及 MUSE-D 通过早期对话干预进行防御,实验证明其能有效识别和减轻漏洞。
English: MUSE is a comprehensive framework addressing multi-turn jailbreaks in large language models by introducing MUSE-A for attacks using frame semantics and tree search, and MUSE-D for defense through early dialogue intervention, effectively identifying and mitigating vulnerabilities as demonstrated in experiments.

Authors:Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, Xiaodong Gu
Title: SWE-QA: Can Language Models Answer Repository-level Code Questions?
Abstract:
Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.
中文: 本文提出SWE-QA这一仓库级代码问答基准,通过涵盖跨文件推理、多跳依赖分析等类别的576个高质量问答对,突破了现有基准局限于小规模代码片段的不足,并开发了基于大语言模型的智能体框架来应对真实软件环境中的复杂推理挑战。
English: This paper introduces SWE-QA, a repository-level code question answering benchmark designed to address the limitations of existing benchmarks by capturing real-world software complexity through 576 diverse question-answer pairs spanning multiple reasoning categories, and proposes a prototype agentic framework for automated QA evaluation.

Authors:Anzhe Chen, Yifei Yang, Zhenjie Zhu, Kechun Xu, Zhongxiang Zhou, Rong Xiong, Yue Wang
Title: Toward Embodiment Equivariant Vision-Language-Action Policy
Abstract:
Vision-language-action policies learn manipulation skills across tasks, environments and embodiments through large-scale pre-training. However, their ability to generalize to novel robot configurations remains limited. Most approaches emphasize model size, dataset scale and diversity while paying less attention to the design of action spaces. This leads to the configuration generalization problem, which requires costly adaptation. We address this challenge by formulating cross-embodiment pre-training as designing policies equivariant to embodiment configuration transformations. Building on this principle, we propose a framework that (i) establishes a embodiment equivariance theory for action space and policy design, (ii) introduces an action decoder that enforces configuration equivariance, and (iii) incorporates a geometry-aware network architecture to enhance embodiment-agnostic spatial reasoning. Extensive experiments in both simulation and real-world settings demonstrate that our approach improves pre-training effectiveness and enables efficient fine-tuning on novel robot embodiments. Our code is available at https://github.com/hhcaz/e2vla
中文: 本文针对视觉-语言-动作策略在新机器人配置上泛化能力有限的问题,提出了一个通过理论原则、等变动作解码器和几何感知架构来强制实现本体等变性的框架,从而提升预训练效果并支持对新机器人本体的高效微调。
English: This paper addresses the limited generalization of vision-language-action policies to new robot configurations by proposing a framework that enforces embodiment equivariance through theoretical principles, an equivariant action decoder, and a geometry-aware architecture, improving pre-training effectiveness and enabling efficient fine-tuning.

Authors:Taesoo Kim, Yongsik Jo, Hyunmin Song, Taehwan Kim
Title: Towards Human-like Multimodal Conversational Agent by Generating Engaging Speech
Abstract:
Human conversation involves language, speech, and visual cues, with each medium providing complementary information. For instance, speech conveys a vibe or tone not fully captured by text alone. While multimodal LLMs focus on generating text responses from diverse inputs, less attention has been paid to generating natural and engaging speech. We propose a human-like agent that generates speech responses based on conversation mood and responsive style information. To achieve this, we build a novel MultiSensory Conversation dataset focused on speech to enable agents to generate natural speech. We then propose a multimodal LLM-based model for generating text responses and voice descriptions, which are used to generate speech covering paralinguistic information. Experimental results demonstrate the effectiveness of utilizing both visual and audio modalities in conversation to generate engaging speech. The source code is available in https://github.com/kimtaesu24/MSenC
Chinese Summary: 本研究提出了一种拟人化对话代理,通过整合视觉和音频线索,利用基于新型多模态大语言模型的系统,在专门构建的多感官对话数据集上训练,实现了自然且富有吸引力的语音生成。
English Summary: This research introduces a human-like conversational agent that generates natural and engaging speech by integrating visual and audio cues, using a novel multimodal LLM-based model trained on a specialized MultiSensory Conversation dataset.

Authors:Hanlong Wan, Xing Lu, Yan Chen, Karthik Devaprasad, Laura Hinkle
Title: Automating Modelica Module Generation Using Large Language Models: A Case Study on Building Control Description Language
Abstract:
Dynamic energy systems and controls require advanced modeling frameworks to design and test supervisory and fault tolerant strategies. Modelica is a widely used equation based language, but developing control modules is labor intensive and requires specialized expertise. This paper examines the use of large language models (LLMs) to automate the generation of Control Description Language modules in the Building Modelica Library as a case study. We developed a structured workflow that combines standardized prompt scaffolds, library aware grounding, automated compilation with OpenModelica, and human in the loop evaluation. Experiments were carried out on four basic logic tasks (And, Or, Not, and Switch) and five control modules (chiller enable/disable, bypass valve control, cooling tower fan speed, plant requests, and relief damper control). The results showed that GPT 4o failed to produce executable Modelica code in zero shot mode, while Claude Sonnet 4 achieved up to full success for basic logic blocks with carefully engineered prompts. For control modules, success rates reached 83 percent, and failed outputs required medium level human repair (estimated one to eight hours). Retrieval augmented generation often produced mismatches in module selection (for example, And retrieved as Or), while a deterministic hard rule search strategy avoided these errors. Human evaluation also outperformed AI evaluation, since current LLMs cannot assess simulation results or validate behavioral correctness. Despite these limitations, the LLM assisted workflow reduced the average development time from 10 to 20 hours down to 4 to 6 hours per module, corresponding to 40 to 60 percent time savings. These results highlight both the potential and current limitations of LLM assisted Modelica generation, and point to future research in pre simulation validation, stronger grounding, and closed loop evaluation.
中文: 本研究证明大型语言模型可自动化生成Modelica控制模块,成功率最高达83%并缩短40-60%开发时间,但当前仍需人工干预进行代码验证与错误修正。
English: This study demonstrates that large language models can automate the generation of Modelica control modules, achieving up to 83% success rates and reducing development time by 40-60%, though current limitations require human intervention for code validation and error correction.

Authors:Feng Ding, Haisheng Fu, Soroush Oraki, Jie Liang
Title: LSTC-MDA: A Unified Framework for Long-Short Term Temporal Convolution and Mixed Data Augmentation in Skeleton-Based Action Recognition
Abstract:
Skeleton-based action recognition faces two longstanding challenges: the scarcity of labeled training samples and difficulty modeling short- and long-range temporal dependencies. To address these issues, we propose a unified framework, LSTC-MDA, which simultaneously improves temporal modeling and data diversity. We introduce a novel Long-Short Term Temporal Convolution (LSTC) module with parallel short- and long-term branches, these two feature branches are then aligned and fused adaptively using learned similarity weights to preserve critical long-range cues lost by conventional stride-2 temporal convolutions. We also extend Joint Mixing Data Augmentation (JMDA) with an Additive Mixup at the input level, diversifying training samples and restricting mixup operations to the same camera view to avoid distribution shifts. Ablation studies confirm each component contributes. LSTC-MDA achieves state-of-the-art results: 94.1% and 97.5% on NTU 60 (X-Sub and X-View), 90.4% and 92.0% on NTU 120 (X-Sub and X-Set),97.2% on NW-UCLA. Code: https://github.com/xiaobaoxia/LSTC-MDA.
中文: 提出的LSTC-MDA框架通过长短时态卷积模块捕捉多尺度时间依赖性和扩展的联合混合数据增强增加样本多样性,显著提升了基于骨架的动作识别性能,在多个基准测试中取得了最优结果。
English: The proposed LSTC-MDA framework enhances skeleton-based action recognition by introducing a Long-Short Term Temporal Convolution module to capture multi-scale temporal dependencies and an extended Joint Mixing Data Augmentation to increase sample diversity, achieving state-of-the-art results on multiple benchmarks.

Authors:Jianglan Wei, Zhenyu Zhang, Pengcheng Wang, Mingjie Zeng, Zhigang Zeng
Title: HDC-X: Efficient Medical Data Classification for Embedded Devices
Abstract:
Energy-efficient medical data classification is essential for modern disease screening, particularly in home and field healthcare where embedded devices are prevalent. While deep learning models achieve state-of-the-art accuracy, their substantial energy consumption and reliance on GPUs limit deployment on such platforms. We present HDC-X, a lightweight classification framework designed for low-power devices. HDC-X encodes data into high-dimensional hypervectors, aggregates them into multiple cluster-specific prototypes, and performs classification through similarity search in hyperspace. We evaluate HDC-X across three medical classification tasks; on heart sound classification, HDC-X is $350\times$ more energy-efficient than Bayesian ResNet with less than 1% accuracy difference. Moreover, HDC-X demonstrates exceptional robustness to noise, limited training data, and hardware error, supported by both theoretical analysis and empirical results, highlighting its potential for reliable deployment in real-world settings. Code is available at https://github.com/jianglanwei/HDC-X.
中文: HDC-X是一种高能效的轻量级分类框架,通过超维计算处理医疗数据,在低功耗设备上实现接近最优的精度,并具备出色的鲁棒性。
English: HDC-X is a highly energy-efficient and lightweight classification framework that uses hyperdimensional computing for medical data, achieving near-state-of-the-art accuracy with exceptional robustness on low-power devices.

Authors:Xinyue Wu, Zixuan Li, Fan Hu, Ting Lin, Xiaotian Zhao, Runxi Wang, Xinfei Guo
Title: Shift-Left Techniques in Electronic Design Automation: A Survey
Abstract:
The chip design process involves numerous steps, beginning with defining product requirements and progressing through architectural planning, system-level design, and the physical layout of individual circuit blocks. As the enablers of large-scale chip development, Electronic Design Automation (EDA) tools play a vital role in helping designers achieve high-quality results. The Shift-Left methodology introduces a pathway toward creating digital twins and fusing multiple design steps, thereby transitioning traditionally sequential, physically-aware processes into virtual design environments. This shift allows designers to establish stronger correlations earlier and optimize designs more effectively. However, challenges remain, especially in accurately replicating downstream behaviors and determining the right scope and timing for adoption. These challenges, in turn, have revealed new opportunities for EDA vendors, physical designers, and logic designers alike. As the industry advances toward intelligent EDA tools and techniques, it is timely to reflect on Shift-Left progress made and the challenges that remain. The rise of AI techniques and the momentum of open-source design flows have significantly strengthened prediction and modeling capabilities, making data-driven methods increasingly relevant to the EDA community. This, in turn, enhances the ''Shift-Left'' features embedded in current tools. In this paper, we present a comprehensive survey of existing and emerging paradigms in Shift-Left research within EDA and the broader design ecosystem. Our goal is to provide a unique perspective on the state of the field and its future directions. Relevant papers mentioned are organized in https://github.com/iCAS-SJTU/Shift-Left-EDA-Papers.
中文: EDA中的Shift-Left方法通过虚拟环境融合设计步骤,利用AI和开源工具实现早期优化,但准确模拟下游行为等挑战仍待解决。
English: The Shift-Left methodology in Electronic Design Automation (EDA) integrates design steps into virtual environments, enabling earlier optimization and enhanced data-driven capabilities through AI and open-source tools, though challenges in replicating downstream behaviors persist.

Authors:Yin Chen, Jia Li, Jinpeng Hu, Zhenzhen Hu, Richang Hong
Title: CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition
Abstract:
Audiovisual emotion recognition (AVER) in the wild is still hindered by pose variation, occlusion, and background noise. Prevailing methods primarily rely on large-scale domain-specific pre-training, which is costly and often mismatched to real-world affective data. To address this, we present CLAIP-Emo, a modular framework that reframes in-the-wild AVER as a parameter-efficient adaptation of language-supervised foundation models (CLIP/CLAP). Specifically, it (i) preserves language-supervised priors by freezing CLIP/CLAP backbones and performing emotion-oriented adaptation via LoRA (updating \ensuremath{\le}4.0\% of the total parameters), (ii) allocates temporal modeling asymmetrically, employing a lightweight Transformer for visual dynamics while applying mean pooling for audio prosody, and (iii) applies a simple fusion head for prediction. On DFEW and MAFW, CLAIP-Emo (ViT-L/14) achieves 80.14\% and 61.18\% weighted average recall with only 8M training parameters, setting a new state of the art. Our findings suggest that parameter-efficient adaptation of language-supervised foundation models provides a scalable alternative to domain-specific pre-training for real-world AVER. The code and models will be available at \href{https://github.com/MSA-LMC/CLAIP-Emo}{https://github.com/MSA-LMC/CLAIP-Emo}.
Chinese: CLAIP-Emo框架通过高效参数调整语言监督模型(如CLIP/CLAP)实现野外视听情感识别,在基准测试中以极少的参数更新量创造了最新性能记录。
English: The CLAIP-Emo framework introduces a parameter-efficient adaptation of language-supervised models like CLIP/CLAP for audiovisual emotion recognition, achieving state-of-the-art results on benchmarks with minimal parameter updates.

Authors:Xinran Zheng, Xingzhi Qian, Yiling He, Shuo Yang, Lorenzo Cavallaro
Title: Beyond Classification: Evaluating LLMs for Fine-Grained Automatic Malware Behavior Auditing
Abstract:
Automated malware classification has achieved strong detection performance. Yet, malware behavior auditing seeks causal and verifiable explanations of malicious activities -- essential not only to reveal what malware does but also to substantiate such claims with evidence. This task is challenging, as adversarial intent is often hidden within complex, framework-heavy applications, making manual auditing slow and costly. Large Language Models (LLMs) could help address this gap, but their auditing potential remains largely unexplored due to three limitations: (1) scarce fine-grained annotations for fair assessment; (2) abundant benign code obscuring malicious signals; and (3) unverifiable, hallucination-prone outputs undermining attribution credibility. To close this gap, we introduce MalEval, a comprehensive framework for fine-grained Android malware auditing, designed to evaluate how effectively LLMs support auditing under real-world constraints. MalEval provides expert-verified reports and an updated sensitive API list to mitigate ground truth scarcity and reduce noise via static reachability analysis. Function-level structural representations serve as intermediate attribution units for verifiable evaluation. Building on this, we define four analyst-aligned tasks -- function prioritization, evidence attribution, behavior synthesis, and sample discrimination -- together with domain-specific metrics and a unified workload-oriented score. We evaluate seven widely used LLMs on a curated dataset of recent malware and misclassified benign apps, offering the first systematic assessment of their auditing capabilities. MalEval reveals both promising potential and critical limitations across audit stages, providing a reproducible benchmark and foundation for future research on LLM-enhanced malware behavior auditing. MalEval is publicly available at https://github.com/ZhengXR930/MalEval.git
中文: MalEval框架通过提供专家验证数据和结构化表示,系统评估大语言模型在四项分析任务中的恶意软件审计能力,既揭示了其潜力也暴露了关键局限,为未来研究建立了可复现基准。
English: The MalEval framework addresses the limitations of large language models in malware auditing by providing expert-verified data and structural representations to systematically evaluate their capabilities across four analyst-aligned tasks, revealing both potential and critical gaps in current approaches.

Authors:Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal
Title: The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration
Abstract:
As large language models (LLMs) become integral to multi-agent systems, new privacy risks emerge that extend beyond memorization, direct inference, or single-turn evaluations. In particular, seemingly innocuous responses, when composed across interactions, can cumulatively enable adversaries to recover sensitive information, a phenomenon we term compositional privacy leakage. We present the first systematic study of such compositional privacy leaks and possible mitigation methods in multi-agent LLM systems. First, we develop a framework that models how auxiliary knowledge and agent interactions jointly amplify privacy risks, even when each response is benign in isolation. Next, to mitigate this, we propose and evaluate two defense strategies: (1) Theory-of-Mind defense (ToM), where defender agents infer a questioner's intent by anticipating how their outputs may be exploited by adversaries, and (2) Collaborative Consensus Defense (CoDef), where responder agents collaborate with peers who vote based on a shared aggregated state to restrict sensitive information spread. Crucially, we balance our evaluation across compositions that expose sensitive information and compositions that yield benign inferences. Our experiments quantify how these defense strategies differ in balancing the privacy-utility trade-off. We find that while chain-of-thought alone offers limited protection to leakage (~39% sensitive blocking rate), our ToM defense substantially improves sensitive query blocking (up to 97%) but can reduce benign task success. CoDef achieves the best balance, yielding the highest Balanced Outcome (79.8%), highlighting the benefit of combining explicit reasoning with defender collaboration. Together, our results expose a new class of risks in collaborative LLM deployments and provide actionable insights for designing safeguards against compositional, context-driven privacy leakage.
中文摘要:本研究揭示了多智能体大语言模型系统中组合式隐私泄露的风险,即看似无害的交互响应在累积中可能泄露敏感信息,并提出心智理论和协作共识两种防御策略,在保护隐私与保持系统效用间实现了最佳平衡。
English Summary: This study identifies compositional privacy leakage in multi-agent LLM systems, where seemingly harmless responses collectively expose sensitive information, and proposes two defense strategies—Theory-of-Mind and Collaborative Consensus—that effectively balance privacy protection with utility.

Authors:Kazumi Kasaura, Naoto Onda, Yuta Oriike, Masaya Taniguchi, Akiyoshi Sannai, Sho Sonoda
Title: Discovering New Theorems via LLMs with In-Context Proof Learning in Lean
Abstract:
Large Language Models have demonstrated significant promise in formal theorem proving. However, previous works mainly focus on solving existing problems. In this paper, we focus on the ability of LLMs to find novel theorems. We propose Conjecturing-Proving Loop pipeline for automatically generating mathematical conjectures and proving them in Lean 4 format. A feature of our approach is that we generate and prove further conjectures with context including previously generated theorems and their proofs, which enables the generation of more difficult proofs by in-context learning of proof strategies without changing parameters of LLMs. We demonstrated that our framework rediscovered theorems with verification, which were published in past mathematical papers and have not yet formalized. Moreover, at least one of these theorems could not be proved by the LLM without in-context learning, even in natural language, which means that in-context learning was effective for neural theorem proving. The source code is available at https://github.com/auto-res/ConjecturingProvingLoop.
中文: 本文提出了一种猜想-证明循环框架,使大语言模型能够基于先前生成的定理和证明进行上下文学习,在Lean 4中自主生成并验证新颖数学定理,无需调整参数即可解决更复杂的证明问题。
English: This paper introduces a Conjecturing-Proving Loop pipeline that enables large language models to autonomously generate and prove novel mathematical theorems in Lean 4, leveraging in-context learning with prior theorems and proofs to tackle increasingly complex problems without altering model parameters.

Authors:Ivan Ternovtsii
Title: Opening the Black Box: Interpretable LLMs via Semantic Resonance Architecture
Abstract:
Large language models (LLMs) achieve remarkable performance but remain difficult to interpret. Mixture-of-Experts (MoE) models improve efficiency through sparse activation, yet typically rely on opaque, learned gating functions. While similarity-based routing (Cosine Routers) has been explored for training stabilization, its potential for inherent interpretability remains largely untapped. We introduce the Semantic Resonance Architecture (SRA), an MoE approach designed to ensure that routing decisions are inherently interpretable. SRA replaces learned gating with a Chamber of Semantic Resonance (CSR) module, which routes tokens based on cosine similarity with trainable semantic anchors. We also introduce a novel Dispersion Loss that encourages orthogonality among anchors to enforce diverse specialization. Experiments on WikiText-103 demonstrate that SRA achieves a validation perplexity of 13.41, outperforming both a dense baseline (14.13) and a Standard MoE baseline (13.53) under matched active parameter constraints (29.0M). Crucially, SRA exhibits superior expert utilization (1.0% dead experts vs. 14.8% in the Standard MoE) and develops distinct, semantically coherent specialization patterns, unlike the noisy specialization observed in standard MoEs. This work establishes semantic routing as a robust methodology for building more transparent and controllable language models.
中文: 语义共振架构提出了一种本质可解释的专家混合模型,通过基于余弦相似度的路由机制替代学习门控,在提升模型性能与专家专业化的同时增强了可解释性。
English: The Semantic Resonance Architecture (SRA) introduces an inherently interpretable mixture-of-experts model that replaces learned gating with cosine similarity-based routing to trainable semantic anchors, achieving superior performance and expert specialization while enhancing transparency.

Authors:Hai Huang, Yann LeCun, Randall Balestriero
Title: LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
Abstract:
Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: https://github.com/rbalestr-lab/llm-jepa.
中文: 该摘要提出LLM-JEPA这一新型联合嵌入预测架构,在多种数据集和模型系列中显著优于标准语言模型训练方法,同时在预训练和微调阶段均展现出优异的抗过拟合能力。
English: This abstract introduces LLM-JEPA, a novel Joint Embedding Predictive Architecture for language models that significantly outperforms standard training methods in both pretraining and finetuning across multiple datasets and model families while demonstrating strong resistance to overfitting.

Authors:Hai Huang, Yann LeCun, Randall Balestriero
Title: LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
Abstract:
Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: https://github.com/rbalestr-lab/llm-jepa.
中文: 该摘要提出LLM-JEPA这一新型联合嵌入预测架构,在多种数据集和模型系列中显著优于标准语言模型训练方法,同时在预训练和微调阶段均展现出优异的抗过拟合能力。
English: This abstract introduces LLM-JEPA, a novel Joint Embedding Predictive Architecture for language models that significantly outperforms standard training methods in both pretraining and finetuning across multiple datasets and model families while demonstrating strong resistance to overfitting.

Authors:Happymore Masoka
Title: Advancing Conversational AI with Shona Slang: A Dataset and Hybrid Model for Digital Inclusion
Abstract:
African languages remain underrepresented in natural language processing (NLP), with most corpora limited to formal registers that fail to capture the vibrancy of everyday communication. This work addresses this gap for Shona, a Bantu language spoken in Zimbabwe and Zambia, by introducing a novel Shona--English slang dataset curated from anonymized social media conversations. The dataset is annotated for intent, sentiment, dialogue acts, code-mixing, and tone, and is publicly available at https://github.com/HappymoreMasoka/Working_with_shona-slang. We fine-tuned a multilingual DistilBERT classifier for intent recognition, achieving 96.4\% accuracy and 96.3\% F1-score, hosted at https://huggingface.co/HappymoreMasoka. This classifier is integrated into a hybrid chatbot that combines rule-based responses with retrieval-augmented generation (RAG) to handle domain-specific queries, demonstrated through a use case assisting prospective students with graduate program information at Pace University. Qualitative evaluation shows the hybrid system outperforms a RAG-only baseline in cultural relevance and user engagement. By releasing the dataset, model, and methodology, this work advances NLP resources for African languages, promoting inclusive and culturally resonant conversational AI.
中文: 本研究发布了首个基于社交媒体对话的绍纳语-英语俚语数据集,并开发了结合规则与检索增强生成的混合聊天机器人,在文化相关性和用户参与度上表现优异,推动了非洲语言自然语言处理资源的包容性发展。
English: This study introduces a publicly available Shona-English slang dataset from social media, annotated for various linguistic features, and presents a high-accuracy hybrid chatbot that enhances cultural relevance in conversational AI for underrepresented African languages.

Authors:Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo
Title: GenExam: A Multidisciplinary Text-to-Image Exam
Abstract:
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights on the path to general AGI. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.
中文摘要:GenExam是首个多学科文本到图像的考试基准,通过考试式提示严格评估AI模型的理解、推理和生成综合能力,实验表明即使最先进模型也面临巨大挑战,为通用人工智能发展提供重要参考。
English Summary: GenExam is a pioneering multidisciplinary text-to-image benchmark that rigorously evaluates AI models' integrated understanding, reasoning, and generation capabilities through exam-style prompts, revealing significant performance gaps even in state-of-the-art models.

Authors:Mengting Ai, Tianxin Wei, Sirui Chen, Jingrui He
Title: NIRVANA: Structured pruning reimagined for large language models compression
Abstract:
Structured pruning of large language models (LLMs) offers substantial efficiency improvements by removing entire hidden units, yet current approaches often suffer from significant performance degradation, particularly in zero-shot settings, and necessitate costly recovery techniques such as supervised fine-tuning (SFT) or adapter insertion. To address these critical shortcomings, we introduce NIRVANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Leveraging a first-order saliency criterion derived from the Neural Tangent Kernel under Adam optimization dynamics, NIRVANA provides a theoretically grounded pruning strategy that respects essential model training behaviors. To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules (attention vs. MLP), which adjusts pruning intensity between modules in a globally balanced manner. Additionally, to mitigate the high sensitivity of pruning decisions to calibration data quality, we propose a simple yet effective KL divergence-based calibration data selection strategy, ensuring more reliable and task-agnostic pruning outcomes. Comprehensive experiments conducted on Llama3, Qwen, and T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints, providing a theoretically sound and practical approach to LLM compression. The code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/NIRVANA.
中文: NIRVANA是一种新颖的大语言模型结构化剪枝方法,通过理论驱动的显著性标准和自适应稀疏分配机制,在保持零样本准确性的同时实现鲁棒的微调能力。
English: NIRVANA is a novel structured pruning method for large language models that preserves zero-shot accuracy while enabling robust fine-tuning through theoretically grounded saliency criteria and adaptive sparsity allocation.

Authors:Yifan Hu, Jie Yang, Tian Zhou, Peiyuan Liu, Yujin Tang, Rong Jin, Liang Sun
Title: Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting
Abstract:
Although contrastive and other representation-learning methods have long been explored in vision and NLP, their adoption in modern time series forecasters remains limited. We believe they hold strong promise for this domain. To unlock this potential, we explicitly align past and future representations, thereby bridging the distributional gap between input histories and future targets. To this end, we introduce TimeAlign, a lightweight, plug-and-play framework that establishes a new representation paradigm, distinct from contrastive learning, by aligning auxiliary features via a simple reconstruction task and feeding them back into any base forecaster. Extensive experiments across eight benchmarks verify its superior performance. Further studies indicate that the gains arise primarily from correcting frequency mismatches between historical inputs and future outputs. Additionally, we provide two theoretical justifications for how reconstruction improves forecasting generalization and how alignment increases the mutual information between learned representations and predicted targets. The code is available at https://github.com/TROUBADOUR000/TimeAlign.
中文摘要:本文提出TimeAlign框架,通过重构任务对齐时间序列的过去与未来表示,弥合分布差异,在多个基准测试中显著提升了预测性能。
English Summary: The paper introduces TimeAlign, a lightweight framework that aligns past and future time series representations through reconstruction to bridge distribution gaps and improve forecasting performance across multiple benchmarks.

Authors:Peng Xu, Shengwu Xiong, Jiajun Zhang, Yaxiong Chen, Bowen Zhou, Chen Change Loy, David A. Clifton, Kyoung Mu Lee, Luc Van Gool, Ruiming He, Ruilin Yao, Xinwei Long, Jirui Huang, Kai Tian, Sa Yang, Yihua Shao, Jin Feng, Yue Zhong, Jiakai Zhou, Cheng Tang, Tianyu Zou, Yifang Zhang, Junming Liang, Guoyou Li, Zhaoxiang Wang, Qiang Zhou, Yichen Zhao, Shili Xiong, Hyeongjin Nam, Jaerin Lee, Jaeyoung Chung, JoonKyu Park, Junghun Oh, Kanggeon Lee, Wooseok Lee, Juneyoung Ro, Turghun Osman, Can Hu, Chaoyang Liao, Cheng Chen, Chengcheng Han, Chenhao Qiu, Chong Peng, Cong Xu, Dailin Li, Feiyu Wang, Feng Gao, Guibo Zhu, Guopeng Tang, Haibo Lu, Han Fang, Han Qi, Hanxiao Wu, Haobo Cheng, Hongbo Sun, Hongyao Chen, Huayong Hu, Hui Li, Jiaheng Ma, Jiang Yu, Jianing Wang, Jie Yang, Jing He, Jinglin Zhou, Jingxuan Li, Josef Kittler, Lihao Zheng, Linnan Zhao, Mengxi Jia, Muyang Yan, Nguyen Thanh Thien, Pu Luo, Qi Li, Shien Song, Shijie Dong, Shuai Shao, Shutao Li, Taofeng Xue, Tianyang Xu, Tianyi Gao, Tingting Li, Wei Zhang, Weiyang Su, Xiaodong Dong, Xiao-Jun Wu, Xiaopeng Zhou, Xin Chen, Xin Wei, Xinyi You, Xudong Kang, Xujie Zhou, Xusheng Liu, Yanan Wang, Yanbin Huang, Yang Liu, Yang Yang, Yanglin Deng, Yashu Kang, Ye Yuan, Yi Wen, Yicen Tian, Yilin Tao, Yin Tang, Yipeng Lin, Yiqing Wang, Yiting Xi, Yongkang Yu, Yumei Li, Yuxin Qin, Yuying Chen, Yuzhe Cen, Zhaofan Zou, Zhaohong Liu, Zhehao Shen, Zhenglin Du, Zhengyang Li, Zhenni Huang, Zhenwei Shao, Zhilong Song, Zhiyong Feng, Zhiyu Wang, Zhou Yu, Ziang Li, Zihan Zhai, Zijian Zhang, Ziyang Peng, Ziyun Xiao, Zongshu Li
Title: MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Abstract:
This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year's MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants' methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page https://github.com/mars2workshop/, where our updates and announcements of upcoming events will be continuously provided.
中文:MARS2 2025挑战赛通过发布Lens和AdsQA两个定制数据集,在三大竞赛赛道中评估多模态模型在现实场景与专业领域的推理能力,已测评40余个基线模型并公开所有资源。
English: The MARS2 2025 Challenge introduces two specialized datasets, Lens and AdsQA, to evaluate multimodal reasoning in real-world and domain-specific scenarios through three competition tracks, with over 40 baselines assessed and all resources made publicly available.

Authors:Jingyi Yuan, Jianxiong Ye, Wenkang Chen, Chenqiang Gao
Title: AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration
Abstract:
Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary novel categories, offering a scalable and annotation-efficient solution. Traditionally, most ZSAD works have been based on the CLIP model, which performs anomaly detection by calculating the similarity between visual and text embeddings. Recently, vision foundation models such as DINOv3 have demonstrated strong transferable representation capabilities. In this work, we are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two key challenges: (i) the domain bias between large-scale pretraining data and anomaly detection tasks leads to feature misalignment; and (ii) the inherent bias toward global semantics in pretrained representations often leads to subtle anomalies being misinterpreted as part of the normal foreground objects, rather than being distinguished as abnormal regions. To overcome these challenges, we introduce AD-DINOv3, a novel vision-language multimodal framework designed for ZSAD. Specifically, we formulate anomaly detection as a multimodal contrastive learning problem, where DINOv3 is employed as the visual backbone to extract patch tokens and a CLS token, and the CLIP text encoder provides embeddings for both normal and abnormal prompts. To bridge the domain gap, lightweight adapters are introduced in both modalities, enabling their representations to be recalibrated for the anomaly detection task. Beyond this baseline alignment, we further design an Anomaly-Aware Calibration Module (AACM), which explicitly guides the CLS token to attend to anomalous regions rather than generic foreground semantics, thereby enhancing discriminability. Extensive experiments on eight industrial and medical benchmarks demonstrate that AD-DINOv3 consistently matches or surpasses state-of-the-art methods.The code will be available at https://github.com/Kaisor-Yuan/AD-DINOv3.
中文摘要:本文提出AD-DINOv3,一种新颖的视觉-语言多模态框架,通过轻量级适配器解决特征对齐问题,并设计异常感知校准模块增强异常区域识别能力,在多个工业与医疗基准测试中达到或超越现有最优方法的零样本异常检测性能。
English Summary: This paper introduces AD-DINOv3, a novel vision-language framework that adapts DINOv3 for Zero-Shot Anomaly Detection by addressing feature misalignment through lightweight adapters and enhancing anomaly discrimination via an Anomaly-Aware Calibration Module, achieving state-of-the-art performance across multiple benchmarks.

Authors:Maosheng Qin, Renyu Zhu, Mingxuan Xia, Chenkai Chen, Zhen Zhu, Minmin Lin, Junbo Zhao, Lu Xu, Changjie Fan, Runze Wu, Haobo Wang
Title: CrowdAgent: Multi-Agent Managed Multi-Source Annotation System
Abstract:
High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human experts-they often focus narrowly on the labeling step itself. A critical gap remains in the holistic process control required to manage these sources dynamically, addressing complex scheduling and quality-cost trade-offs in a unified manner. Inspired by real-world crowdsourcing companies, we introduce CrowdAgent, a multi-agent system that provides end-to-end process control by integrating task assignment, data annotation, and quality/cost management. It implements a novel methodology that rationally assigns tasks, enabling LLMs, SLMs, and human experts to advance synergistically in a collaborative annotation workflow. We demonstrate the effectiveness of CrowdAgent through extensive experiments on six diverse multimodal classification tasks. The source code and video demo are available at https://github.com/QMMMS/CrowdAgent.
中文: CrowdAgent是一个多智能体系统,通过动态整合任务分配、数据标注和质量成本管理,为LLM、SLM和人类专家提供端到端的协同标注流程控制。
English: CrowdAgent is a multi-agent system that provides end-to-end process control for data annotation by dynamically integrating task assignment, annotation, and quality-cost management across LLMs, SLMs, and human experts.

Authors:Sunkyung Lee, Seongmin Park, Jonghyo Kim, Mincheol Yoon, Jongwuk Lee
Title: Enhancing Time Awareness in Generative Recommendation
Abstract:
Generative recommendation has emerged as a promising paradigm that formulates the recommendations into a text-to-text generation task, harnessing the vast knowledge of large language models. However, existing studies focus on considering the sequential order of items and neglect to handle the temporal dynamics across items, which can imply evolving user preferences. To address this limitation, we propose a novel model, Generative Recommender Using Time awareness (GRUT), effectively capturing hidden user preferences via various temporal signals. We first introduce Time-aware Prompting, consisting of two key contexts. The user-level temporal context models personalized temporal patterns across timestamps and time intervals, while the item-level transition context provides transition patterns across users. We also devise Trend-aware Inference, a training-free method that enhances rankings by incorporating trend information about items with generation likelihood. Extensive experiments demonstrate that GRUT outperforms state-of-the-art models, with gains of up to 15.4% and 14.3% in Recall@5 and NDCG@5 across four benchmark datasets. The source code is available at https://github.com/skleee/GRUT.
中文摘要:提出的GRUT模型通过时间感知提示和趋势感知推理,在生成式推荐中有效捕捉时序动态,相比现有方法实现了显著性能提升。
English Summary: The proposed GRUT model enhances generative recommendation by incorporating temporal dynamics through time-aware prompting and trend-aware inference, achieving significant performance improvements over existing methods.

Authors:Harvey Mannering, Zhiwu Huang, Adam Prugel-Bennett
Title: Noise-Level Diffusion Guidance: Well Begun is Half Done
Abstract:
Diffusion models have achieved state-of-the-art image generation. However, the random Gaussian noise used to start the diffusion process influences the final output, causing variations in image quality and prompt adherence. Existing noise-level optimization approaches generally rely on extra dataset construction, additional networks, or backpropagation-based optimization, limiting their practicality. In this paper, we propose Noise Level Guidance (NLG), a simple, efficient, and general noise-level optimization approach that refines initial noise by increasing the likelihood of its alignment with general guidance - requiring no additional training data, auxiliary networks, or backpropagation. The proposed NLG approach provides a unified framework generalizable to both conditional and unconditional diffusion models, accommodating various forms of diffusion-level guidance. Extensive experiments on five standard benchmarks demonstrate that our approach enhances output generation quality and input condition adherence. By seamlessly integrating with existing guidance methods while maintaining computational efficiency, our method establishes NLG as a practical and scalable enhancement to diffusion models. Code can be found at https://github.com/harveymannering/NoiseLevelGuidance.
中文: 本文提出噪声水平引导(NLG)方法,通过优化初始噪声无需额外数据、网络或反向传播,有效提升扩散模型的图像生成质量与提示遵循度,适用于多种引导方式并保持计算效率。
English: This paper introduces Noise Level Guidance (NLG), a simple and efficient method that refines initial noise in diffusion models to improve image quality and prompt adherence without requiring extra data, networks, or backpropagation, demonstrating effectiveness across various benchmarks and guidance methods.

Authors:Mariano Barone, Antonio Romano, Giuseppe Riccio, Marco Postiglione, Vincenzo Moscato
Title: Combating Biomedical Misinformation through Multi-modal Claim Detection and Evidence-based Verification
Abstract:
Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https://github.com/PRAISELab-PicusLab/CER
中文摘要:CER框架通过整合科学证据检索、大语言模型推理和监督验证预测,有效提升生物医学事实核查的准确性,减少错误信息风险并确保结果基于可验证证据。
English Summary: The CER framework enhances biomedical fact-checking by integrating evidence retrieval, reasoning with large language models, and supervised prediction to reduce misinformation risks while grounding outputs in verifiable scientific sources.

Authors:Zhen Xu, Guorui Lu, Chang Gao, Qinyu Chen
Title: EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View
Abstract:
Hand tracking holds great promise for intuitive interaction paradigms, but frame-based methods often struggle to meet the requirements of accuracy, low latency, and energy efficiency, especially in resource-constrained settings such as Extended Reality (XR) devices. Event cameras provide $μ$s-level temporal resolution at mW-level power by asynchronously sensing brightness changes. In this work, we present EvHand-FPV, a lightweight framework for egocentric First-Person-View 3D hand tracking from a single event camera. We construct an event-based FPV dataset that couples synthetic training data with 3D labels and real event data with 2D labels for evaluation to address the scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based region of interest (ROI) that localizes the hand region via geometric cues, combined with an end-to-end mapping strategy that embeds ROI offsets into the network to reduce computation without explicit reconstruction, and a multi-task learning strategy with an auxiliary geometric feature head that improves representations without test-time overhead. On our real FPV test set, EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from 11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results demonstrate accurate and efficient egocentric event-based hand tracking suitable for on-device XR applications. The dataset and code are available at https://github.com/zen5x5/EvHand-FPV.
中文: EvHand-FPV是一种轻量级框架,利用单个事件相机实现高效三维手部追踪,在显著降低计算需求的同时保持高精度,适用于设备端XR应用。
English: EvHand-FPV is a lightweight framework for efficient 3D hand tracking using a single event camera, achieving high accuracy with significantly reduced computational demands for on-device XR applications.

Authors:Jovana Videnovic, Matej Kristan, Alan Lukezic
Title: Distractor-Aware Memory-Based Visual Object Tracking
Abstract:
Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten. Furthermore, integrating the proposed distractor-aware memory into a real-time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non-real-time SAM2.1-L on multiple tracking and segmentation benchmarks, while integration with edge-based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.
中文: 提出的DAM4SAM模块通过引入干扰物感知记忆系统改进了SAM2,有效减少跟踪漂移并提升目标遮挡后的重检测能力,在多个基准测试中创下最新最优成绩,且与其他跟踪器集成时展现出优秀的泛化性能。
English: The proposed DAM4SAM module enhances SAM2 by incorporating a distractor-aware memory system that mitigates tracking drift and improves redetection after occlusion, achieving state-of-the-art results across multiple benchmarks while demonstrating strong generalization when integrated with other trackers.

Authors:Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang
Title: EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics
Abstract:
Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.
中文摘要:EDITS框架通过融合视觉语言模型生成的文本语义与图像特征,并利用大型语言模型构建文本原型,有效提升了数据集蒸馏过程中对高级语义信息的保留能力,显著优于传统仅关注低级特征的方法。
English Summary: The EDITS framework enhances dataset distillation by integrating textual semantics from vision-language models and large language models to synthesize compact datasets that preserve high-level semantic information, outperforming traditional methods focused on low-level features.

Authors:Puru Vaish, Felix Meister, Tobias Heimann, Christoph Brune, Jelmer M. Wolterink
Title: Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation
Abstract:
Many recent approaches in representation learning implicitly assume that uncorrelated views of a data point are sufficient to learn meaningful representations for various downstream tasks. In this work, we challenge this assumption and demonstrate that meaningful structure in the latent space does not emerge naturally. Instead, it must be explicitly induced. We propose a method that aligns representations from different views of the data to align complementary information without inducing false positives. Our experiments show that our proposed self-supervised learning method, Consistent View Alignment, improves performance for downstream tasks, highlighting the critical role of structured view alignment in learning effective representations. Our method achieved first and second place in the MICCAI 2025 SSL3D challenge when using a Primus vision transformer and ResEnc convolutional neural network, respectively. The code and pretrained model weights are released at https://github.com/Tenbatsu24/LatentCampus.
中文摘要:本研究挑战了不相关数据视图足以学习有效表征的假设,提出名为“一致视图对齐”的自监督方法,通过显式构建潜在空间结构提升下游任务性能,并在MICCAI 2025挑战赛中取得领先排名。
English Summary: The study challenges the assumption that uncorrelated data views suffice for learning meaningful representations, proposing a self-supervised method called Consistent View Alignment that explicitly structures latent space to improve downstream task performance, as evidenced by top rankings in the MICCAI 2025 challenge.

Authors:Nguyen Lan Vi Vu, Thanh-Huy Nguyen, Thien Nguyen, Daisuke Kihara, Tianyang Wang, Xingjian Li, Min Xu
Title: Semi-MoE: Mixture-of-Experts meets Semi-Supervised Histopathology Segmentation
Abstract:
Semi-supervised learning has been employed to alleviate the need for extensive labeled data for histopathology image segmentation, but existing methods struggle with noisy pseudo-labels due to ambiguous gland boundaries and morphological misclassification. This paper introduces Semi-MOE, to the best of our knowledge, the first multi-task Mixture-of-Experts framework for semi-supervised histopathology image segmentation. Our approach leverages three specialized expert networks: A main segmentation expert, a signed distance field regression expert, and a boundary prediction expert, each dedicated to capturing distinct morphological features. Subsequently, the Multi-Gating Pseudo-labeling module dynamically aggregates expert features, enabling a robust fuse-and-refine pseudo-labeling mechanism. Furthermore, to eliminate manual tuning while dynamically balancing multiple learning objectives, we propose an Adaptive Multi-Objective Loss. Extensive experiments on GlaS and CRAG benchmarks show that our method outperforms state-of-the-art approaches in low-label settings, highlighting the potential of MoE-based architectures in advancing semi-supervised segmentation. Our code is available at https://github.com/vnlvi2k3/Semi-MoE.
中文: 本文提出Semi-MOE框架,首次采用多任务专家混合模型解决组织病理学图像半监督分割中的伪标签噪声问题,通过专业化网络和自适应损失机制在基准测试中实现最优性能。
English: This paper introduces Semi-MOE, a novel multi-task Mixture-of-Experts framework that addresses noisy pseudo-labels in semi-supervised histopathology image segmentation through specialized expert networks and an adaptive loss mechanism, demonstrating superior performance on benchmark datasets.

Authors:Jiayu Yuan, Ming Dai, Enhui Zheng, Chao Su, Nanxing Chen, Qiming Hu, Shibo Zhu, Yibin Cao
Title: SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments
Abstract:
Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been extensively investigated for Global Navigation Satellite System (GNSS)-denied environments. However, existing retrieval-based approaches face limitations in dataset availability and persistent challenges including suboptimal real-time performance, environmental sensitivity, and limited generalization capability, particularly in dynamic or temporally varying environments. To overcome these limitations, we present a large-scale Multi-Altitude Flight Segments dataset (MAFS) for variable altitude scenarios and propose a novel Semantic-Weighted Adaptive Particle Filter (SWA-PF) method. This approach integrates robust semantic features from both UAV-captured images and satellite imagery through two key innovations: a semantic weighting mechanism and an optimized particle filtering architecture. Evaluated using our dataset, the proposed method achieves 10x computational efficiency gain over feature extraction methods, maintains global positioning errors below 10 meters, and enables rapid 4 degree of freedom (4-DoF) pose estimation within seconds using accessible low-resolution satellite maps. Code and dataset will be available at https://github.com/YuanJiayuuu/SWA-PF.
中文总结:本研究提出大规模多高度飞行段数据集和新型语义加权自适应粒子滤波方法,在无GNSS环境下实现10倍计算效率提升、10米内定位精度及快速4自由度姿态估计。
English Summary: This study introduces a large-scale Multi-Altitude Flight Segments dataset and a novel Semantic-Weighted Adaptive Particle Filter method that achieves 10x computational efficiency, sub-10-meter positioning accuracy, and rapid 4-DoF pose estimation in GNSS-denied environments.

Authors:Huichun Liu, Xiaosong Li, Yang Liu, Xiaoqi Cheng, Haishu Tan
Title: NDLPNet: A Location-Aware Nighttime Deraining Network and a Real-World Benchmark Dataset
Abstract:
Visual degradation caused by rain streak artifacts in low-light conditions significantly hampers the performance of nighttime surveillance and autonomous navigation. Existing image deraining techniques are primarily designed for daytime conditions and perform poorly under nighttime illumination due to the spatial heterogeneity of rain distribution and the impact of light-dependent stripe visibility. In this paper, we propose a novel Nighttime Deraining Location-enhanced Perceptual Network(NDLPNet) that effectively captures the spatial positional information and density distribution of rain streaks in low-light environments. Specifically, we introduce a Position Perception Module (PPM) to capture and leverage spatial contextual information from input data, enhancing the model's capability to identify and recalibrate the importance of different feature channels. The proposed nighttime deraining network can effectively remove the rain streaks as well as preserve the crucial background information. Furthermore, We construct a night scene rainy (NSR) dataset comprising 900 image pairs, all based on real-world nighttime scenes, providing a new benchmark for nighttime deraining task research. Extensive qualitative and quantitative experimental evaluations on both existing datasets and the NSR dataset consistently demonstrate our method outperform the state-of-the-art (SOTA) methods in nighttime deraining tasks. The source code and dataset is available at https://github.com/Feecuin/NDLPNet.
中文: 本文提出的夜间去雨定位增强感知网络(NDLPNet)能有效消除弱光环境下的雨纹并保留背景信息,通过在真实夜间场景数据集上的验证表明其性能优于现有最优方法。
English: The proposed Nighttime Deraining Location-enhanced Perceptual Network (NDLPNet) effectively removes rain streaks while preserving background details in low-light conditions, outperforming existing methods as validated on a new real-world nighttime dataset.

Authors:Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Jun Du, Quan Liu, Jianqing Gao
Title: THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
Abstract:
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.
中文: 提出的THOR框架通过多智能体数据生成流程、分层强化学习优化和推理中的自我修正机制,解决了大语言模型在数学推理中的不足,在数学和代码基准测试中均实现了最优性能。
English: The proposed THOR framework addresses LLMs' limitations in mathematical reasoning by integrating tools through a multi-agent data generation pipeline, hierarchical reinforcement learning optimization, and self-correction during inference, achieving state-of-the-art performance across mathematical and code benchmarks.

Authors:Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jun Du, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Quan Liu, Jianqing Gao
Title: THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning
Abstract:
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.
中文: 提出的THOR框架通过多智能体数据生成流程、分层强化学习优化和推理中的自我修正机制,解决了大语言模型在数学推理中的不足,在数学和代码基准测试中均实现了最优性能。
English: The proposed THOR framework addresses LLMs' limitations in mathematical reasoning by integrating tools through a multi-agent data generation pipeline, hierarchical reinforcement learning optimization, and self-correction during inference, achieving state-of-the-art performance across mathematical and code benchmarks.

Authors:Jinwoo Jeon, JunHyeok Oh, Hayeong Lee, Byung-Jun Lee
Title: Iterative Prompt Refinement for Safer Text-to-Image Generation
Abstract:
Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper contains examples of harmful or inappropriate images generated by models.
中文: 本文提出一种迭代式提示优化算法,通过视觉语言模型综合分析文本提示与生成图像,在提升安全性的同时保持用户意图,并提供了用于监督微调的新型数据集。
English: This paper introduces an iterative prompt refinement algorithm that uses Vision Language Models to analyze both text prompts and generated images, enhancing safety without compromising user intent while offering a new dataset for supervised fine-tuning.

Authors:Hao Yin, Xin Man, Feiyu Chen, Jie Shao, Heng Tao Shen
Title: Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval
Abstract:
Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning -- hence the term ``full-mode" -- without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at https://github.com/yinhao1102/FMFA.
中文:FMFA框架通过结合显式细粒度对齐和隐式关系推理,无需额外监督即可实现更优的跨模态匹配,从而提升文本到图像的人物检索效果。
English: The proposed FMFA framework enhances text-to-image person retrieval by combining explicit fine-grained alignment and implicit relational reasoning to achieve superior cross-modal matching without extra supervision.

Authors:Hyotaek Jeon, Hyunwook Lee, Juwon Kim, Sungahn Ko
Title: ST-LINK: Spatially-Aware Large Language Models for Spatio-Temporal Forecasting
Abstract:
Traffic forecasting represents a crucial problem within intelligent transportation systems. In recent research, Large Language Models (LLMs) have emerged as a promising method, but their intrinsic design, tailored primarily for sequential token processing, introduces notable challenges in effectively capturing spatial dependencies. Specifically, the inherent limitations of LLMs in modeling spatial relationships and their architectural incompatibility with graph-structured spatial data remain largely unaddressed. To overcome these limitations, we introduce ST-LINK, a novel framework that enhances the capability of Large Language Models to capture spatio-temporal dependencies. Its key components are Spatially-Enhanced Attention (SE-Attention) and the Memory Retrieval Feed-Forward Network (MRFFN). SE-Attention extends rotary position embeddings to integrate spatial correlations as direct rotational transformations within the attention mechanism. This approach maximizes spatial learning while preserving the LLM's inherent sequential processing structure. Meanwhile, MRFFN dynamically retrieves and utilizes key historical patterns to capture complex temporal dependencies and improve the stability of long-term forecasting. Comprehensive experiments on benchmark datasets demonstrate that ST-LINK surpasses conventional deep learning and LLM approaches, and effectively captures both regular traffic patterns and abrupt changes.
中文摘要:ST-LINK是一种新颖的框架,通过空间增强注意力和记忆检索前馈网络增强大语言模型在交通预测中捕捉时空依赖关系的能力,实验证明其性能优于传统方法。
English Summary: ST-LINK is a novel framework that enhances Large Language Models' ability to capture spatio-temporal dependencies in traffic forecasting through Spatially-Enhanced Attention and Memory Retrieval Feed-Forward Network, demonstrating superior performance over conventional methods.

Authors:Ming Dai, Wenxuan Cheng, Jiang-Jiang Liu, Lingfeng Yang, Zhenhua Feng, Wankou Yang, Jingdong Wang
Title: Improving Generalized Visual Grounding with Instance-aware Joint Learning
Abstract:
Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at https://github.com/Dmmm1997/InstanceVG.
中文:InstanceVG是一种创新的多任务框架,通过引入实例感知能力统一处理广义指代表达式理解与分割任务,借助实例查询实现框与掩码的一致性预测,在多项评估中达到最优性能。
English: InstanceVG is a novel multi-task framework that jointly addresses Generalized Referring Expression Comprehension and Segmentation by incorporating instance-aware capabilities, achieving state-of-the-art performance through unified predictions of boxes and masks.

Authors:Zirun Guo, Feng Zhang, Kai Jia, Tao Jin
Title: LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Abstract:
We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the "one-tool" bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: https://github.com/ByteDance-BandAI/LLM-I.
中文: LLM-Interleaved (LLM-I) 是一个将交错式图文生成重构为工具使用问题的灵活框架,通过强化学习训练智能体协调多种视觉工具,在多个基准测试中大幅超越现有方法。
English: LLM-Interleaved (LLM-I) is a dynamic framework that transforms interleaved image-text generation into a tool-use problem, enabling an LLM agent to intelligently orchestrate specialized visual tools and achieve state-of-the-art performance across multiple benchmarks.

Authors:Jeremy Oon, Rakhi Manohar Mepparambath, Ling Feng
Title: DeepLogit: A sequentially constrained explainable deep learning modeling approach for transport policy analysis
Abstract:
Despite the significant progress of deep learning models in multitude of applications, their adaption in planning and policy related areas remains challenging due to the black-box nature of these models. In this work, we develop a set of DeepLogit models that follow a novel sequentially constrained approach in estimating deep learning models for transport policy analysis. In the first step of the proposed approach, we estimate a convolutional neural network (CNN) model with only linear terms, which is equivalent of a linear-in-parameter multinomial logit model. We then estimate other deep learning models by constraining the parameters that need interpretability at the values obtained in the linear-in-parameter CNN model and including higher order terms or by introducing advanced deep learning architectures like Transformers. Our approach can retain the interpretability of the selected parameters, yet provides significantly improved model accuracy than the discrete choice model. We demonstrate our approach on a transit route choice example using real-world transit smart card data from Singapore. This study shows the potential for a unifying approach, where theory-based discrete choice model (DCM) and data-driven AI models can leverage each other's strengths in interpretability and predictive power. With the availability of larger datasets and more complex constructions, such approach can lead to more accurate models using discrete choice models while maintaining its applicability in planning and policy-related areas. Our code is available on https://github.com/jeremyoon/route-choice/ .
中文: DeepLogit模型通过融合离散选择模型的可解释线性参数与先进深度学习架构,在保持可解释性的同时显著提升了交通政策分析的预测准确性。
English: The DeepLogit model integrates interpretable linear parameters from discrete choice models with advanced deep learning architectures, enhancing predictive accuracy while maintaining interpretability for transport policy applications.

Authors:Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu
Title: See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Abstract:
The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30\%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.
中文摘要:本文提出状态感知推理(StaR)训练方法,通过教导多模态智能体感知当前切换状态并解析指令中的目标状态,将切换指令执行准确率提升超过30%,同时在多个基准测试中有效提升通用任务性能。
English Summary: This paper introduces State-aware Reasoning (StaR), a training method that significantly improves multimodal agents' accuracy in executing toggle instructions by over 30% through teaching them to perceive current states and analyze desired actions, while also enhancing general task performance across benchmarks.

Authors:Jiangbei Yue, Shuonan Yang, Tailin Chen, Jianbo Jiao, Zeyu Fu
Title: Multimodal Hate Detection Using Dual-Stream Graph Neural Networks
Abstract:
Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.
中文: 该研究提出的多模态双流图神经网络模型通过构建实例图和权重图来突出仇恨内容并捕捉结构化关系,有效检测仇恨视频,实现了最先进的分类性能和强可解释性。
English: The proposed multimodal dual-stream graph neural network model effectively detects hateful videos by constructing instance and weight graphs to emphasize hateful content and capture structured relationships, achieving state-of-the-art performance and explainability.

Authors:Uriel Garcilazo-Cruz, Joseph O. Okeme, Rodrigo A. Vargas--Hernández
Title: LivePyxel: Accelerating image annotations with a Python-integrated webcam live streaming
Abstract:
The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where real-time data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce \texttt{LivePixel}, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable real-time image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of Bézier splines and binary masks, and the software's capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it's optimized for object detection operations via the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations via Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel freely available at https://github.com/UGarCil/LivePyxel
中文: LivePixel 是一款基于 Python 的工具,可直接从显微镜等设备进行实时图像标注,通过整合贝塞尔曲线和二进制掩码等灵活工具,解决了传统软件的局限性,从而加速科学工作流程中人工智能模型的开发。
English: LivePixel is a Python-based tool that enables real-time image annotation directly from devices like microscopes, addressing the limitations of traditional software by integrating flexible tools such as Bézier splines and binary masks to streamline AI model development in scientific workflows.

Authors:Hao Xu, Xiaolin Wu, Xi Zhang
Title: Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization
Abstract:
3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.
中文: 3D高斯泼溅(3DGS)虽能实现逼真渲染但数据量庞大需压缩,采用场景自适应格型矢量量化(SALVQ)替代均匀标量量化,能以极低开销提升压缩性能并实现动态码率调节。
English: 3D Gaussian Splatting (3DGS) achieves photorealistic rendering but requires compression for cost efficiency, and replacing uniform scalar quantization with scene-adaptive lattice vector quantization (SALVQ) enhances compression performance with minimal overhead while enabling dynamic bit rate adjustment.

Authors:Vincent Siu, Nicholas Crispino, David Park, Nathan W. Henry, Zhun Wang, Yang Liu, Dawn Song, Chenguang Wang
Title: SteeringControl: Holistic Evaluation of Alignment Steering in LLMs
Abstract:
We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.
中文: 本文提出SteeringControl基准,用于评估表征引导方法在偏见和幻觉等对齐目标上的效果,发现引导效果取决于方法、模型和行为的相互作用,并公开了相关代码。
English: This paper introduces SteeringControl, a benchmark for evaluating representation steering methods across alignment objectives like bias and hallucination, revealing that steering effectiveness depends on the interplay between methods, models, and behaviors, with code made publicly available.

Authors:Zixi Li
Title: Asterisk Operator
Abstract:
We propose the \textbf{Asterisk Operator} ($\ast$-operator), a novel unified framework for abstract reasoning based on Adjacency-Structured Parallel Propagation (ASPP). The operator formalizes structured reasoning tasks as local, parallel state evolution processes guided by implicit relational graphs. We prove that the $\ast$-operator maintains local computational constraints while achieving global reasoning capabilities, providing an efficient and convergent computational paradigm for abstract reasoning problems. Through rigorous mathematical analysis and comprehensive experiments on ARC2 challenges and Conway's Game of Life, we demonstrate the operator's universality, convergence properties, and superior performance. Our innovative Embedding-Asterisk distillation method achieves 100\% accuracy on ARC2 validation with only 6M parameters, representing a significant breakthrough in neural-symbolic reasoning. \textbf{Keywords:} Abstract Reasoning, Adjacency Structure, Parallel Propagation, Asterisk Operator, Convergence, Universal Approximation
中文摘要:Asterisk算子是一种基于邻接结构并行传播的新型抽象推理框架,通过创新的嵌入-星号蒸馏方法,仅用600万参数即在ARC2验证集上实现100%准确率,标志着神经符号推理领域的重大突破。
English Summary: The Asterisk Operator is a novel unified framework for abstract reasoning that formalizes structured tasks as parallel state evolution processes, achieving 100% accuracy on ARC2 validation with only 6M parameters through its innovative distillation method.

Authors:Zihao Wang, Muyao Li, Kaichen He, Xiangyu Wang, Zhancun Mu, Anji Liu, Yitao Liang
Title: OpenHA: A Series of Open-Source Hierarchical Agentic Models in Minecraft
Abstract:
The choice of action spaces is a critical yet unresolved challenge in developing capable, end-to-end trainable agents. This paper first presents a large-scale, systematic comparison of prominent abstracted action spaces and tokenizers for Vision-Language-Action (VLA) or hierarchical agent models in the open-ended Minecraft. Our analysis reveals that no single action space is universally optimal; instead, the most effective abstraction is highly task-dependent, creating a dilemma for building generalist agents. To resolve this, we introduce Chain of Action (CoA), a novel framework that unifies high-level planning and low-level control within a single, monolithic VLA model. CoA treats an abstracted action not as a command for a separate policy, but as an intermediate reasoning step--akin to a chain of thought--that guides the generation of the final, executable action. Furthermore, we demonstrate that an All-in-One agent trained on a diverse mixture of action spaces using the CoA paradigm learns a more robust and generalizable policy. This unified agent achieves a new state-of-the-art, improving the overall task success rate over strong, specialized baselines. To foster reproducible research, we release the OpenHA (Open Hierarchical Agents) suite, which includes our comprehensive benchmark of over 800 distinct tasks, curated datasets, source code, and all pretrained model checkpoints at https://github.com/CraftJarvis/OpenHA
中文摘要:本文提出Chain of Action(CoA)新框架,将高层规划与低层控制统一于单一视觉-语言-动作模型中,证明在多样化动作空间上训练的智能体可获得更强泛化能力,并在《我的世界》中实现了最先进的性能。
English Summary: This paper introduces Chain of Action (CoA), a novel framework that integrates high-level planning with low-level control in a single Vision-Language-Action model, demonstrating that training agents on diverse action spaces yields more robust policies and achieves state-of-the-art performance in Minecraft.

Authors:Anand Swaroop, Akshat Nallani, Saksham Uboweja, Adiliia Uzdenova, Michael Nguyen, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Vasu Sharma, Maheep Chaudhary
Title: FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness
Abstract:
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks, but recent work shows that reasoning steps often fail to causally influence the final answer, creating brittle and untrustworthy outputs. Prior approaches focus primarily on measuring faithfulness, while methods for systematically improving it remain limited. We introduce Faithful Reasoning via Intervention Training (FRIT), a scalable alignment method that trains models to produce causally consistent reasoning by learning from systematically corrupted examples. FRIT generates synthetic training data by intervening on individual reasoning steps in model-generated CoTs, creating faithful/unfaithful pairs that highlight when reasoning breaks down. We then apply Direct Preference Optimization to teach models to prefer causally consistent reasoning paths. Evaluating on Qwen3-8B and Mistral-7B-v0.1 across factual and symbolic reasoning tasks, FRIT increases faithful reasoning by $3.4$ percentage points for Mistral on GSM8K while improving accuracy by $7.6$ percentage points. Our approach provides the first scalable, supervision-free method for training language models to produce more reliable and interpretable reasoning, addressing a critical gap between reasoning performance and trustworthiness. We release our code at \href{https://github.com/Anut-py/frit}.
中文: FRIT是一种通过干预推理步骤生成合成训练数据,并利用直接偏好优化教导模型选择因果一致推理路径的可扩展对齐方法,有效提升了语言模型在事实和符号推理任务中的忠实推理能力和准确性。
English: FRIT is a scalable alignment method that improves the causal consistency and trustworthiness of chain-of-thought reasoning in language models by training them with synthetic data generated through intervention on reasoning steps, resulting in enhanced accuracy and faithful reasoning across various tasks.

Authors:Zeyu Ma, Adam Finkelstein, Jia Deng
Title: Temporally Smooth Mesh Extraction for Procedural Scenes with Long-Range Camera Trajectories using Spacetime Octrees
Abstract:
The procedural occupancy function is a flexible and compact representation for creating 3D scenes. For rasterization and other tasks, it is often necessary to extract a mesh that represents the shape. Unbounded scenes with long-range camera trajectories, such as flying through a forest, pose a unique challenge for mesh extraction. A single static mesh representing all the geometric detail necessary for the full camera path can be prohibitively large. Therefore, independent meshes can be extracted for different camera views, but this approach may lead to popping artifacts during transitions. We propose a temporally coherent method for extracting meshes suitable for long-range camera trajectories in unbounded scenes represented by an occupancy function. The key idea is to perform 4D mesh extraction using a new spacetime tree structure called a binary-octree. Experiments show that, compared to existing baseline methods, our method offers superior visual consistency at a comparable cost. The code and the supplementary video for this paper are available at https://github.com/princeton-vl/BinocMesher.
中文: 该方法通过四维时空二叉八叉树提取无界场景的时间相干网格,在相近成本下相比基线方法实现了更优的视觉一致性。
English: The proposed method extracts temporally coherent meshes for unbounded scenes using a 4D spacetime binary-octree, achieving superior visual consistency at comparable cost compared to baseline methods.

Authors:Zeyu Ma, Adam Finkelstein, Jia Deng
Title: Temporally Smooth Mesh Extraction for Procedural Scenes with Long-Range Camera Trajectories using Spacetime Octrees
Abstract:
The procedural occupancy function is a flexible and compact representation for creating 3D scenes. For rasterization and other tasks, it is often necessary to extract a mesh that represents the shape. Unbounded scenes with long-range camera trajectories, such as flying through a forest, pose a unique challenge for mesh extraction. A single static mesh representing all the geometric detail necessary for the full camera path can be prohibitively large. Therefore, independent meshes can be extracted for different camera views, but this approach may lead to popping artifacts during transitions. We propose a temporally coherent method for extracting meshes suitable for long-range camera trajectories in unbounded scenes represented by an occupancy function. The key idea is to perform 4D mesh extraction using a new spacetime tree structure called a binary-octree. Experiments show that, compared to existing baseline methods, our method offers superior visual consistency at a comparable cost. The code and the supplementary video for this paper are available at https://github.com/princeton-vl/BinocMesher.
中文: 该方法通过四维时空二叉八叉树提取无界场景的时间相干网格,在相近成本下相比基线方法实现了更优的视觉一致性。
English: The proposed method extracts temporally coherent meshes for unbounded scenes using a 4D spacetime binary-octree, achieving superior visual consistency at comparable cost compared to baseline methods.

Authors:Rodrigo M Carrillo-Larco
Title: LLMs for energy and macronutrients estimation using only text data from 24-hour dietary recalls: a parameter-efficient fine-tuning experiment using a 10-shot prompt
Abstract:
BACKGROUND: Most artificial intelligence tools used to estimate nutritional content rely on image input. However, whether large language models (LLMs) can accurately predict nutritional values based solely on text descriptions of foods consumed remains unknown. If effective, this approach could enable simpler dietary monitoring without the need for photographs. METHODS: We used 24-hour dietary recalls from adolescents aged 12-19 years in the National Health and Nutrition Examination Survey (NHANES). An open-source quantized LLM was prompted using a 10-shot, chain-of-thought approach to estimate energy and five macronutrients based solely on text strings listing foods and their quantities. We then applied parameter-efficient fine-tuning (PEFT) to evaluate whether predictive accuracy improved. NHANES-calculated values served as the ground truth for energy, proteins, carbohydrates, total sugar, dietary fiber and total fat. RESULTS: In a pooled dataset of 11,281 adolescents (49.9% male, mean age 15.4 years), the vanilla LLM yielded poor predictions. The mean absolute error (MAE) was 652.08 for energy and the Lin's CCC <0.46 across endpoints. In contrast, the fine-tuned model performed substantially better, with energy MAEs ranging from 171.34 to 190.90 across subsets, and Lin's CCC exceeding 0.89 for all outcomes. CONCLUSIONS: When prompted using a chain-of-thought approach and fine-tuned with PEFT, open-source LLMs exposed solely to text input can accurately predict energy and macronutrient values from 24-hour dietary recalls. This approach holds promise for low-burden, text-based dietary monitoring tools.
中文: 经过微调的大语言模型仅通过文本饮食描述即可精确预测能量和宏量营养素,为低负担的饮食监测提供了有望的文本解决方案。
English: Fine-tuned large language models using text-only dietary descriptions can accurately predict energy and macronutrients, offering a low-burden alternative to image-based nutritional assessment tools.

Authors:Zhizhong Zhao, Ke Chen
Title: Post-Hoc Split-Point Self-Consistency Verification for Efficient, Unified Quantification of Aleatoric and Epistemic Uncertainty in Deep Learning
Abstract:
Uncertainty quantification (UQ) is vital for trustworthy deep learning, yet existing methods are either computationally intensive, such as Bayesian or ensemble methods, or provide only partial, task-specific estimates, such as single-forward-pass techniques. In this paper, we propose a post-hoc single-forward-pass framework that jointly captures aleatoric and epistemic uncertainty without modifying or retraining pretrained models. Our method applies \emph{Split-Point Analysis} (SPA) to decompose predictive residuals into upper and lower subsets, computing \emph{Mean Absolute Residuals} (MARs) on each side. We prove that, under ideal conditions, the total MAR equals the harmonic mean of subset MARs; deviations define a novel \emph{Self-consistency Discrepancy Score} (SDS) for fine-grained epistemic estimation across regression and classification. For regression, side-specific quantile regression yields prediction intervals with improved empirical coverage, which are further calibrated via SDS. For classification, when calibration data are available, we apply SPA-based calibration identities to adjust the softmax outputs and then compute predictive entropy on these calibrated probabilities. Extensive experiments on diverse regression and classification benchmarks demonstrate that our framework matches or exceeds several state-of-the-art UQ methods while incurring minimal overhead. Our source code is available at https://github.com/zzz0527/SPC-UQ.
中文: 本文提出了一种无需重新训练模型的后处理单次前向传播框架,通过分割点分析和自洽性差异评分同时捕捉任意性和认知不确定性,在多种基准测试中以最小计算开销达到或超越了现有最优方法。
English: This paper introduces a post-hoc single-forward-pass framework that captures both aleatoric and epistemic uncertainty without retraining models, using Split-Point Analysis and a Self-consistency Discrepancy Score to achieve state-of-the-art performance with minimal computational overhead.

Authors:Hugo Carlesso, Josiane Mothe, Radu Tudor Ionescu
Title: Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation
Abstract:
Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at https://github.com/hugocarlesso/CMTSSL.
中文: 高光谱成像需要紧凑模型以支持星载高效处理,新提出的课程多任务自监督学习框架通过整合空间与光谱推理的轻量化设计,在模型比现有技术轻16,000倍的情况下仍保持优异性能。
English: Hyperspectral imaging requires compact models for efficient onboard satellite processing, which is addressed by the novel curriculum multi-task self-supervised learning framework that integrates spatial and spectral reasoning in a lightweight design, achieving strong performance with models over 16,000 times lighter than existing ones.

Authors:Jiahao Xu, Zikai Zhang, Rui Hu
Title: On the Out-of-Distribution Backdoor Attack for Federated Learning
Abstract:
Traditional backdoor attacks in federated learning (FL) operate within constrained attack scenarios, as they depend on visible triggers and require physical modifications to the target object, which limits their practicality. To address this limitation, we introduce a novel backdoor attack prototype for FL called the out-of-distribution (OOD) backdoor attack ($\mathtt{OBA}$), which uses OOD data as both poisoned samples and triggers simultaneously. Our approach significantly broadens the scope of backdoor attack scenarios in FL. To improve the stealthiness of $\mathtt{OBA}$, we propose $\mathtt{SoDa}$, which regularizes both the magnitude and direction of malicious local models during local training, aligning them closely with their benign versions to evade detection. Empirical results demonstrate that $\mathtt{OBA}$ effectively circumvents state-of-the-art defenses while maintaining high accuracy on the main task. To address this security vulnerability in the FL system, we introduce $\mathtt{BNGuard}$, a new server-side defense method tailored against $\mathtt{SoDa}$. $\mathtt{BNGuard}$ leverages the observation that OOD data causes significant deviations in the running statistics of batch normalization layers. This allows $\mathtt{BNGuard}$ to identify malicious model updates and exclude them from aggregation, thereby enhancing the backdoor robustness of FL. Extensive experiments across various settings show the effectiveness of $\mathtt{BNGuard}$ on defending against $\mathtt{SoDa}$. The code is available at https://github.com/JiiahaoXU/SoDa-BNGuard.
中文: 本文提出了一种新颖的联邦学习分布外后门攻击(OBA),利用OOD数据作为触发器,并开发了隐蔽增强方法SoDa,同时设计了BNGuard防御机制,通过批归一化层统计检测恶意更新以增强系统安全性。
English: This paper introduces a novel out-of-distribution backdoor attack (OBA) for federated learning that uses OOD data as triggers, along with a stealth-enhancing method SoDa, and proposes BNGuard defense that detects malicious updates through batch normalization statistics to secure FL systems.

Authors:Salvatore Esposito, Matías Mattamala, Daniel Rebain, Francis Xiatian Zhang, Kevin Dhaliwal, Mohsen Khadem, Subramanian Ramamoorthy
Title: ROOM: A Physics-Based Continuum Robot Simulator for Photorealistic Medical Datasets Generation
Abstract:
Continuum robots are advancing bronchoscopy procedures by accessing complex lung airways and enabling targeted interventions. However, their development is limited by the lack of realistic training and test environments: Real data is difficult to collect due to ethical constraints and patient safety concerns, and developing autonomy algorithms requires realistic imaging and physical feedback. We present ROOM (Realistic Optical Observation in Medicine), a comprehensive simulation framework designed for generating photorealistic bronchoscopy training data. By leveraging patient CT scans, our pipeline renders multi-modal sensor data including RGB images with realistic noise and light specularities, metric depth maps, surface normals, optical flow and point clouds at medically relevant scales. We validate the data generated by ROOM in two canonical tasks for medical robotics -- multi-view pose estimation and monocular depth estimation, demonstrating diverse challenges that state-of-the-art methods must overcome to transfer to these medical settings. Furthermore, we show that the data produced by ROOM can be used to fine-tune existing depth estimation models to overcome these challenges, also enabling other downstream applications such as navigation. We expect that ROOM will enable large-scale data generation across diverse patient anatomies and procedural scenarios that are challenging to capture in clinical settings. Code and data: https://github.com/iamsalvatore/room.
中文: ROOM仿真框架通过患者CT扫描生成逼真的支气管镜训练数据,解决了真实数据收集的伦理限制,为连续体机器人医疗程序的自主算法开发提供了关键支持。
English: The ROOM simulation framework generates photorealistic bronchoscopy training data from patient CT scans to overcome limitations in real data collection, enabling the development of autonomy algorithms for continuum robots in medical procedures.

Authors:Yingtai Li, Haoran Lai, Xiaoqian Zhou, Shuai Ming, Wenxin Ma, Wei Wei, Shaohua Kevin Zhou
Title: More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
Abstract:
The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96\% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale "silver-standard" datasets at a minimal cost (~\$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this "silver-standard" dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8\% AUC for zero-shot diagnosis on CT-RATE, 77.3\% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7\% for image-image, Recall@100=52.2\% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.
中文: 大型语言模型能够自动从放射报告中提取诊断标签,以极低成本创建大规模医疗数据集,在医学人工智能系统的视觉语言预训练中实现了最先进的性能。
English: Large Language Models enable cost-effective creation of large-scale medical datasets by automatically extracting diagnostic labels from radiology reports, achieving state-of-the-art performance in vision-language pre-training for medical AI systems.

Authors:Ruifei Ding, Zhe Chen, Wen Fan, Chen Long, Huijuan Xiao, Yelu Zeng, Zhen Dong, Bisheng Yang
Title: WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory
Abstract:
Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks--tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: https://github.com/WHU-USI3DV/WHU-STree.
中文: WHU-STree数据集通过提供跨城市、多模态且标注丰富的街道树木数据,解决了现有数据集的局限性,支持10余项任务,并在树种分类和单木分割中展现了数据融合的实际应用潜力。
English: The WHU-STree dataset addresses limitations in urban street tree inventories by providing a cross-city, multi-modal collection with rich annotations, supporting over 10 tasks and demonstrating the potential of data fusion for practical algorithm deployment in tree species classification and segmentation.

Authors:Zhihao Zhang, Chunyu Lin, Lang Nie, Jiyuan Wang, Yao Zhao
Title: Advancing Real-World Parking Slot Detection with Large-Scale Dataset and Semi-Supervised Baseline
Abstract:
As automatic parking systems evolve, the accurate detection of parking slots has become increasingly critical. This study focuses on parking slot detection using surround-view cameras, which offer a comprehensive bird's-eye view of the parking environment. However, the current datasets are limited in scale, and the scenes they contain are seldom disrupted by real-world noise (e.g., light, occlusion, etc.). Moreover, manual data annotation is prone to errors and omissions due to the complexity of real-world conditions, significantly increasing the cost of annotating large-scale datasets. To address these issues, we first construct a large-scale parking slot detection dataset (named CRPS-D), which includes various lighting distributions, diverse weather conditions, and challenging parking slot variants. Compared with existing datasets, the proposed dataset boasts the largest data scale and consists of a higher density of parking slots, particularly featuring more slanted parking slots. Additionally, we develop a semi-supervised baseline for parking slot detection, termed SS-PSD, to further improve performance by exploiting unlabeled data. To our knowledge, this is the first semi-supervised approach in parking slot detection, which is built on the teacher-student model with confidence-guided mask consistency and adaptive feature perturbation. Experimental results demonstrate the superiority of SS-PSD over the existing state-of-the-art (SoTA) solutions on both the proposed dataset and the existing dataset. Particularly, the more unlabeled data there is, the more significant the gains brought by our semi-supervised scheme. The relevant source codes and the dataset have been made publicly available at https://github.com/zzh362/CRPS-D.
中文: 本研究提出了大规模停车位检测数据集CRPS-D以解决现有数据集在真实场景覆盖上的不足,并开发了半监督方法SS-PSD,通过利用未标注数据显著提升了检测性能,超越了现有最优方案。
English: This study introduces CRPS-D, a large-scale parking slot detection dataset addressing limitations of existing datasets by including diverse real-world conditions, and proposes SS-PSD, a semi-supervised method that outperforms state-of-the-art solutions by leveraging unlabeled data.

Authors:Sijia Cui, Shuai Xu, Aiyao He, Yanna Wang, Bo Xu
Title: Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning
Abstract:
Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP.
Chinese Summary: PLAP框架通过整合技能库、大语言模型规划器和技能执行器,显著提升基于大语言模型的智能体在复杂长周期环境中的表现,在实时策略游戏等任务中实现了超越基准的优异性能。
English Summary: The PLAP framework enhances LLM-based agents' performance in complex environments by integrating a skill library, LLM-powered planner, and skill executor, achieving superior results in long-horizon tasks like real-time strategy games.

Authors:Zijie Zhao, Honglei Guo, Shengqian Chen, Kaixuan Xu, Bo Jiang, Yuanheng Zhu, Dongbin Zhao
Title: Empowering Multi-Robot Cooperation via Sequential World Models
Abstract:
Model-based reinforcement learning (MBRL) has shown significant potential in robotics due to its high sample efficiency and planning capability. However, extending MBRL to multi-robot cooperation remains challenging due to the complexity of joint dynamics. To address this, we propose the Sequential World Model (SeqWM), a novel framework that integrates the sequential paradigm into model-based multi-agent reinforcement learning. SeqWM employs independent, sequentially structured agent-wise world models to decompose complex joint dynamics. Latent rollouts and decision-making are performed through sequential communication, where each agent generates its future trajectory and plans its actions based on the predictions of its predecessors. This design enables explicit intention sharing, enhancing cooperative performance, and reduces communication overhead to linear complexity. Results in challenging simulated environments (Bi-DexHands and Multi-Quad) show that SeqWM outperforms existing state-of-the-art model-free and model-based baselines in both overall performance and sample efficiency, while exhibiting advanced cooperative behaviors such as predictive adaptation and role division. Furthermore, SeqWM has been success fully deployed on physical quadruped robots, demonstrating its effectiveness in real-world multi-robot systems. Demos and code are available at: https://github.com/zhaozijie2022/seqwm-marl
中文摘要:提出的序列世界模型(SeqWM)通过顺序智能体世界模型和通信机制分解复杂联合动力学,在仿真和实际机器人部署中均实现了卓越的合作性能与样本效率。
English Summary: The proposed Sequential World Model (SeqWM) enhances multi-robot cooperation by decomposing complex joint dynamics through sequential agent-wise world models and communication, achieving superior performance and sample efficiency in both simulations and real-world deployments.

Authors:Yujie Guo, Jiaming Zhou, Yuhang Jia, Shiwan Zhao, Yong Qin
Title: GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR
Abstract:
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which dynamically fuse speaker-aware global information and fine-grained local features to guide expert selection. This mechanism enables speaker-specific routing by leveraging both global context and local acoustic cues. Experiments on LibriSpeechMix show that GLAD outperforms existing MTASR approaches, particularly in challenging multi-talker scenarios. To our best knowledge, this is the first work to apply Mixture-of-Experts (MoE) to end-to-end MTASR with a global-local fusion strategy. Our code and train dataset can be found at https://github.com/NKU-HLT/GLAD.
Chinese: 提出的GLAD混合专家模型通过动态融合说话人全局信息和局部声学特征来改进重叠语音识别,在LibriSpeechMix数据集上展现出优于现有方法的性能表现。
English: The proposed GLAD Mixture-of-Experts dynamically integrates global speaker context with local acoustic features to enhance overlapping speech recognition, demonstrating superior performance on LibriSpeechMix compared to existing methods.

Authors:Yan Xingyang, Huang Xiaohong, Zhang Zhao, You Tian, Xu Ziheng
Title: Using KL-Divergence to Focus Frequency Information in Low-Light Image Enhancement
Abstract:
In the Fourier domain, luminance information is primarily encoded in the amplitude spectrum, while spatial structures are captured in the phase components. The traditional Fourier Frequency information fitting employs pixel-wise loss functions, which tend to focus excessively on local information and may lead to global information loss. In this paper, we present LLFDisc, a U-shaped deep enhancement network that integrates cross-attention and gating mechanisms tailored for frequency-aware enhancement. We propose a novel distribution-aware loss that directly fits the Fourier-domain information and minimizes their divergence using a closed-form KL-Divergence objective. This enables the model to align Fourier-domain information more robustly than with conventional MSE-based losses. Furthermore, we enhance the perceptual loss based on VGG by embedding KL-Divergence on extracted deep features, enabling better structural fidelity. Extensive experiments across multiple benchmarks demonstrate that LLFDisc achieves state-of-the-art performance in both qualitative and quantitative evaluations. Our code will be released at: https://github.com/YanXY000/LLFDisc
中文摘要:本文提出LLFDisc,一种采用新型分布感知损失和增强感知损失的U形深度增强网络,通过鲁棒对齐傅里叶域信息并提升结构保真度,实现了最先进的性能表现。
English Summary: This paper introduces LLFDisc, a U-shaped deep enhancement network with a novel distribution-aware loss and enhanced perceptual loss, which achieves state-of-the-art performance by robustly aligning Fourier-domain information and improving structural fidelity.

Authors:Yan Xingyang, Huang Xiaohong, Zhang Zhao, You Tian, Xu Ziheng
Title: Using KL-Divergence to Focus Frequency Information in Low-Light Image Enhancement
Abstract:
In the Fourier domain, luminance information is primarily encoded in the amplitude spectrum, while spatial structures are captured in the phase components. The traditional Fourier Frequency information fitting employs pixel-wise loss functions, which tend to focus excessively on local information and may lead to global information loss. In this paper, we present LLFDisc, a U-shaped deep enhancement network that integrates cross-attention and gating mechanisms tailored for frequency-aware enhancement. We propose a novel distribution-aware loss that directly fits the Fourier-domain information and minimizes their divergence using a closed-form KL-Divergence objective. This enables the model to align Fourier-domain information more robustly than with conventional MSE-based losses. Furthermore, we enhance the perceptual loss based on VGG by embedding KL-Divergence on extracted deep features, enabling better structural fidelity. Extensive experiments across multiple benchmarks demonstrate that LLFDisc achieves state-of-the-art performance in both qualitative and quantitative evaluations. Our code will be released at: https://github.com/YanXY000/LLFDisc
中文摘要:本文提出LLFDisc,一种采用新型分布感知损失和增强感知损失的U形深度增强网络,通过鲁棒对齐傅里叶域信息并提升结构保真度,实现了最先进的性能表现。
English Summary: This paper introduces LLFDisc, a U-shaped deep enhancement network with a novel distribution-aware loss and enhanced perceptual loss, which achieves state-of-the-art performance by robustly aligning Fourier-domain information and improving structural fidelity.

Authors:Yukun Chen, Zhaoxi Mu, Andong Li, Peilin Li, Xinyu Yang
Title: Spiking Vocos: An Energy-Efficient Neural Vocoder
Abstract:
Despite the remarkable progress in the synthesis speed and fidelity of neural vocoders, their high energy consumption remains a critical barrier to practical deployment on computationally restricted edge devices. Spiking Neural Networks (SNNs), widely recognized for their high energy efficiency due to their event-driven nature, offer a promising solution for low-resource scenarios. In this paper, we propose Spiking Vocos, a novel spiking neural vocoder with ultra-low energy consumption, built upon the efficient Vocos framework. To mitigate the inherent information bottleneck in SNNs, we design a Spiking ConvNeXt module to reduce Multiply-Accumulate (MAC) operations and incorporate an amplitude shortcut path to preserve crucial signal dynamics. Furthermore, to bridge the performance gap with its Artificial Neural Network (ANN) counterpart, we introduce a self-architectural distillation strategy to effectively transfer knowledge. A lightweight Temporal Shift Module is also integrated to enhance the model's ability to fuse information across the temporal dimension with negligible computational overhead. Experiments demonstrate that our model achieves performance comparable to its ANN counterpart, with UTMOS and PESQ scores of 3.74 and 3.45 respectively, while consuming only 14.7% of the energy. The source code is available at https://github.com/pymaster17/Spiking-Vocos.
Chinese: Spiking Vocos是一种超低能耗的脉冲神经声码器,通过Spiking ConvNeXt模块和自架构蒸馏策略,在仅消耗14.7%能耗的情况下实现了与人工神经网络相当的性能表现。
English: Spiking Vocos is an ultra-low energy spiking neural vocoder that achieves performance comparable to its ANN counterpart while consuming only 14.7% of the energy through innovative modules like Spiking ConvNeXt and self-architectural distillation.

Authors:Eyal German, Daniel Samira, Yuval Elovici, Asaf Shabtai
Title: MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data
Abstract:
Synthetic data generation plays an important role in enabling data sharing, particularly in sensitive domains like healthcare and finance. Recent advances in diffusion models have made it possible to generate realistic, high-quality tabular data, but they may also memorize training records and leak sensitive information. Membership inference attacks (MIAs) exploit this vulnerability by determining whether a record was used in training. While MIAs have been studied in images and text, their use against tabular diffusion models remains underexplored despite the unique risks of structured attributes and limited record diversity. In this paper, we introduce MIAEPT, Membership Inference Attack via Error Prediction for Tabular Data, a novel black-box attack specifically designed to target tabular diffusion models. MIA-EPT constructs errorbased feature vectors by masking and reconstructing attributes of target records, disclosing membership signals based on how well these attributes are predicted. MIA-EPT operates without access to the internal components of the generative model, relying only on its synthetic data output, and was shown to generalize across multiple state-of-the-art diffusion models. We validate MIA-EPT on three diffusion-based synthesizers, achieving AUC-ROC scores of up to 0.599 and TPR@10% FPR values of 22.0% in our internal tests. Under the MIDST 2025 competition conditions, MIA-EPT achieved second place in the Black-box Multi-Table track (TPR@10% FPR = 20.0%). These results demonstrate that our method can uncover substantial membership leakage in synthetic tabular data, challenging the assumption that synthetic data is inherently privacy-preserving. Our code is publicly available at https://github.com/eyalgerman/MIA-EPT.
中文: 本文提出MIA-EPT这一黑盒成员推理攻击方法,通过分析重构误差有效识别表格扩散模型的训练数据泄露,挑战了合成数据天生保护隐私的假设。
English: This paper introduces MIA-EPT, a black-box membership inference attack method that effectively identifies training data leakage in tabular diffusion models by analyzing reconstruction errors, challenging the assumption that synthetic data inherently preserves privacy.

Authors:Eyal German, Daniel Samira, Yuval Elovici, Asaf Shabtai
Title: MIA-EPT: Membership Inference Attack via Error Prediction for Tabular Data
Abstract:
Synthetic data generation plays an important role in enabling data sharing, particularly in sensitive domains like healthcare and finance. Recent advances in diffusion models have made it possible to generate realistic, high-quality tabular data, but they may also memorize training records and leak sensitive information. Membership inference attacks (MIAs) exploit this vulnerability by determining whether a record was used in training. While MIAs have been studied in images and text, their use against tabular diffusion models remains underexplored despite the unique risks of structured attributes and limited record diversity. In this paper, we introduce MIAEPT, Membership Inference Attack via Error Prediction for Tabular Data, a novel black-box attack specifically designed to target tabular diffusion models. MIA-EPT constructs errorbased feature vectors by masking and reconstructing attributes of target records, disclosing membership signals based on how well these attributes are predicted. MIA-EPT operates without access to the internal components of the generative model, relying only on its synthetic data output, and was shown to generalize across multiple state-of-the-art diffusion models. We validate MIA-EPT on three diffusion-based synthesizers, achieving AUC-ROC scores of up to 0.599 and TPR@10% FPR values of 22.0% in our internal tests. Under the MIDST 2025 competition conditions, MIA-EPT achieved second place in the Black-box Multi-Table track (TPR@10% FPR = 20.0%). These results demonstrate that our method can uncover substantial membership leakage in synthetic tabular data, challenging the assumption that synthetic data is inherently privacy-preserving. Our code is publicly available at https://github.com/eyalgerman/MIA-EPT.
中文: 本文提出MIA-EPT这一黑盒成员推理攻击方法,通过分析重构误差有效识别表格扩散模型的训练数据泄露,挑战了合成数据天生保护隐私的假设。
English: This paper introduces MIA-EPT, a black-box membership inference attack method that effectively identifies training data leakage in tabular diffusion models by analyzing reconstruction errors, challenging the assumption that synthetic data inherently preserves privacy.

Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang
Title: Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection
Abstract:
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.
中文: 本文提出了一种双阶段重加权专家混合框架,通过融合多模型特征和专用分类器,有效检测第一人称视频中细微且罕见的用户错误行为。
English: This paper introduces a Dual-Stage Reweighed Mixture-of-Experts (DR-MoE) framework that effectively detects subtle and infrequent user errors in egocentric videos by combining multi-model features and specialized classifiers.

Authors:Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang
Title: Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection
Abstract:
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.
中文: 本文提出了一种双阶段重加权专家混合框架,通过融合多模型特征和专用分类器,有效检测第一人称视频中细微且罕见的用户错误行为。
English: This paper introduces a Dual-Stage Reweighed Mixture-of-Experts (DR-MoE) framework that effectively detects subtle and infrequent user errors in egocentric videos by combining multi-model features and specialized classifiers.

Authors:Heng Zhang, Chengzhi Zhang
Title: Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework
Abstract:
The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of "AI for Science". However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.
中文: 本研究提出了一个端到端框架,通过挖掘全文学术论文生成结构化研究流程,采用PU学习和提示学习等技术识别和分类流程组件,最终生成可视化流程图并揭示自然语言处理领域二十年来方法论的演变。
English: This study introduces an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers, employing techniques like PU learning and prompt learning to identify and categorize workflow components, ultimately producing visual flowcharts and revealing methodological shifts in NLP over two decades.

Authors:Hojat Ardi, Amir Jahanshahi, Ali Diba
Title: T-SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking
Abstract:
Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: https://github.com/to/be/released
Chinese: T-SiamTPN通过在SiamTPN架构中引入时序建模,显著提升了空中目标跟踪的鲁棒性和精度,同时保持实时计算效率,适用于嵌入式应用场景。
English: T-SiamTPN enhances aerial object tracking by integrating temporal modeling into the SiamTPN framework, improving robustness and achieving state-of-the-art performance while maintaining real-time efficiency on embedded devices.

Authors:Weiming Chen, Zhihan Zhu, Yijia Wang, Zhihai He
Title: Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing
Abstract:
Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.
Chinese: 整流流模型面临反转精度低和多模态注意力纠缠的挑战,通过基于龙格-库塔求解器的高阶反转方法和解耦扩散变换器注意力机制,在保真度和可编辑性方面实现了最优性能。
English: Rectified flow models face challenges with inversion accuracy and entangled multimodal attention, which are addressed through a high-order inversion method using the Runge-Kutta solver and a Decoupled Diffusion Transformer Attention mechanism, achieving state-of-the-art performance in fidelity and editability.

Authors:Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, Guoquan Zhang
Title: Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder
Abstract:
Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: https://github.com/xiaomi-research/lego-edit.
Chinese: Lego-Edit利用多模态大语言模型组织编辑工具包,并通过渐进式强化学习训练,在开放领域指令上展现出强大的推理能力,无需额外微调即可使用新工具,在GEdit-Bench和ImgBench上实现了最先进的性能。
English: Lego-Edit utilizes a Multi-modal Large Language Model to orchestrate a toolkit of editing models and employs progressive reinforcement learning, achieving state-of-the-art performance by generalizing effectively to diverse real-world instructions without requiring additional fine-tuning for new tools.

Authors:Yabo Zhang, Yihan Zeng, Qingyun Li, Zhen Hu, Kavin Han, Wangmeng Zuo
Title: Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use
Abstract:
Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10\% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.
Chinese: Tool-R1 是一个强化学习框架,通过生成可执行的 Python 代码和集成用户定义工具,提升大语言模型处理复杂多步骤任务的能力,在 GAIA 基准测试中显著提高了准确性和鲁棒性。
English: Tool-R1 is a reinforcement learning framework that enhances large language models' ability to perform complex, multi-step tasks using executable Python code and integrated tools, significantly improving accuracy and robustness on benchmarks like GAIA.

Authors:Moritz Heinlein, Florian Messerer, Moritz Diehl, Sergio Lucia
Title: Ellipsoidal partitions for improved multi-stage robust model predictive control
Abstract:
Ellipsoidal tube-based model predictive control methods effectively account for the propagation of the reachable set, typically employing linear feedback policies. In contrast, scenario-based approaches offer more flexibility in the feedback structure by considering different control actions for different branches of a scenario tree. However, they face challenges in ensuring rigorous guarantees. This work aims to integrate the strengths of both methodologies by enhancing ellipsoidal tube-based MPC with a scenario tree formulation. The uncertainty ellipsoids are partitioned by halfspaces such that each partitioned set can be controlled independently. The proposed ellipsoidal multi-stage approach is demonstrated in a human-robot system, highlighting its advantages in handling uncertainty while maintaining computational tractability.
中文摘要:本研究通过将不确定性椭球体分区控制,将椭球管模型预测控制与场景树框架相结合,在保持计算可行性的同时,有效提升了人机系统应对不确定性的能力。
English Summary: This work integrates ellipsoidal tube-based MPC with scenario tree formulation by partitioning uncertainty ellipsoids, enabling independent control of each partitioned set while maintaining computational tractability in human-robot systems.

Authors:Julien Walther, Rémi Giraud, Michaël Clément
Title: Superpixel Anything: A general object-based framework for accurate yet regular superpixel segmentation
Abstract:
Superpixels are widely used in computer vision to simplify image representation and reduce computational complexity. While traditional methods rely on low-level features, deep learning-based approaches leverage high-level features but also tend to sacrifice regularity of superpixels to capture complex objects, leading to accurate but less interpretable segmentations. In this work, we introduce SPAM (SuperPixel Anything Model), a versatile framework for segmenting images into accurate yet regular superpixels. We train a model to extract image features for superpixel generation, and at inference, we leverage a large-scale pretrained model for semantic-agnostic segmentation to ensure that superpixels align with object masks. SPAM can handle any prior high-level segmentation, resolving uncertainty regions, and is able to interactively focus on specific objects. Comprehensive experiments demonstrate that SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks, making it a valuable and robust tool for various applications. Code and pre-trained models are available here: https://github.com/waldo-j/spam.
中文摘要:SPAM是一个多功能框架,通过结合图像特征和大规模预训练模型,生成既准确又规则化的超像素,在分割任务中超越了现有方法。
English Summary: SPAM is a versatile framework that generates accurate and regular superpixels by integrating image features with large-scale pretrained models, outperforming existing methods in segmentation tasks.

Authors:Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang, Jiafei Wu
Title: Contextualized Representation Learning for Effective Human-Object Interaction Detection
Abstract:
Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures . This enables our model to identify tool-dependent interactions such as 'filling'. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.
中文: 本文提出了一种情境化表征学习方法,通过结合功能引导推理和情境提示与视觉线索来增强人物-物体交互检测,在标准基准测试中取得了优越性能。
English: This paper introduces a Contextualized Representation Learning method that enhances Human-Object Interaction detection by incorporating affordance-guided reasoning and contextual prompts with visual cues, achieving superior performance on standard benchmarks.

Authors:Siju Ma, Changsiyu Gong, Xiaofeng Fan, Yong Ma, Chengjie Jiang
Title: RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation
Abstract:
Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.
中文摘要:作者提出RIS-FUSION框架,通过联合优化将参照图像分割融入文本驱动的红外-可见光图像融合,在新建的大规模数据集上实现最佳性能,mIoU指标提升超过11%。
English Summary: The authors propose RIS-FUSION, a unified framework that enhances text-driven infrared-visible image fusion by incorporating referring image segmentation through joint optimization, achieving state-of-the-art performance with an 11% mIoU improvement.

Authors:Pratik Nag
Title: Spatio-temporal DeepKriging in PyTorch: A Supplementary Application to Precipitation Data for Interpolation and Probabilistic Forecasting
Abstract:
A detailed analysis of precipitation data over Europe is presented, with a focus on interpolation and forecasting applications. A Spatio-temporal DeepKriging (STDK) framework has been implemented using the PyTorch platform to achieve these objectives. The proposed model is capable of handling spatio-temporal irregularities while generating high-resolution interpolations and multi-step forecasts. Reproducible code modules have been developed as standalone PyTorch implementations for the interpolation\footnote[2]{Interpolation - https://github.com/pratiknag/Spatio-temporalDeepKriging-Pytorch.git} and forecasting\footnote[3]{Forecasting - https://github.com/pratiknag/pytorch-convlstm.git}, facilitating broader application to similar climate datasets. The effectiveness of this approach is demonstrated through extensive evaluation on daily precipitation measurements, highlighting predictive performance and robustness.
本研究通过PyTorch平台开发了时空深度克里金框架,能够生成高分辨率插值和多步预测来处理欧洲降水数据,其有效性经过验证且代码已开源。
This study introduces a Spatio-temporal DeepKriging framework using PyTorch to generate high-resolution interpolations and multi-step forecasts for European precipitation data, with demonstrated effectiveness and publicly available code.

Authors:Wenzhuo Jin, Qianfeng Yang, Xianhao Wu, Hongming Chen, Pengpeng Li, Xiang Chen
Title: SmokeBench: A Real-World Dataset for Surveillance Image Desmoking in Early-Stage Fire Scenes
Abstract:
Early-stage fire scenes (0-15 minutes after ignition) represent a crucial temporal window for emergency interventions. During this stage, the smoke produced by combustion significantly reduces the visibility of surveillance systems, severely impairing situational awareness and hindering effective emergency response and rescue operations. Consequently, there is an urgent need to remove smoke from images to obtain clear scene information. However, the development of smoke removal algorithms remains limited due to the lack of large-scale, real-world datasets comprising paired smoke-free and smoke-degraded images. To address these limitations, we present a real-world surveillance image desmoking benchmark dataset named SmokeBench, which contains image pairs captured under diverse scenes setup and smoke concentration. The curated dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of desmoking methods on our dataset. Our dataset provides a valuable foundation for advancing robust and practical image desmoking in real-world fire scenes. This dataset has been released to the public and can be downloaded from https://github.com/ncfjd/SmokeBench.
中文摘要:火灾初期烟雾会严重遮挡监控视野,为此我们开发了SmokeBench真实场景数据集,提供成对的无烟与有烟图像,以推动图像去烟算法发展,提升应急救援效能。
English Summary: Early-stage fires produce smoke that obscures surveillance visibility, prompting the creation of SmokeBench—a real-world dataset of paired smoke-free and smoke-degraded images to advance desmoking algorithms for emergency response.

Authors:Xianda Guo, Chenming Zhang, Ruilin Wang, Youmin Zhang, Wenzhao Zheng, Matteo Poggi, Hao Zhao, Qin Zou, Long Chen
Title: StereoCarla: A High-Fidelity Driving Dataset for Generalizable Stereo
Abstract:
Stereo matching plays a crucial role in enabling depth perception for autonomous driving and robotics. While recent years have witnessed remarkable progress in stereo matching algorithms, largely driven by learning-based methods and synthetic datasets, the generalization performance of these models remains constrained by the limited diversity of existing training data. To address these challenges, we present StereoCarla, a high-fidelity synthetic stereo dataset specifically designed for autonomous driving scenarios. Built on the CARLA simulator, StereoCarla incorporates a wide range of camera configurations, including diverse baselines, viewpoints, and sensor placements as well as varied environmental conditions such as lighting changes, weather effects, and road geometries. We conduct comprehensive cross-domain experiments across four standard evaluation datasets (KITTI2012, KITTI2015, Middlebury, ETH3D) and demonstrate that models trained on StereoCarla outperform those trained on 11 existing stereo datasets in terms of generalization accuracy across multiple benchmarks. Furthermore, when integrated into multi-dataset training, StereoCarla contributes substantial improvements to generalization accuracy, highlighting its compatibility and scalability. This dataset provides a valuable benchmark for developing and evaluating stereo algorithms under realistic, diverse, and controllable settings, facilitating more robust depth perception systems for autonomous vehicles. Code can be available at https://github.com/XiandaGuo/OpenStereo, and data can be available at https://xiandaguo.net/StereoCarla.
中文: StereoCarla是基于CARLA模拟器开发的高保真合成立体数据集,通过整合多样化的相机配置和环境条件,显著提升了立体匹配算法在自动驾驶中的泛化能力,并在多个基准测试中表现出卓越性能。
English: StereoCarla is a high-fidelity synthetic stereo dataset developed using the CARLA simulator to enhance the generalization of stereo matching algorithms in autonomous driving by incorporating diverse camera configurations and environmental conditions, demonstrating superior performance across multiple benchmarks.

Authors:Ziyun Liu, Chris Donahue
Title: Osu2MIR: Beat Tracking Dataset Derived From Osu! Data
Abstract:
In this work, we explore the use of Osu!, a community-based rhythm game, as an alternative source of beat and downbeat annotations. Osu! beatmaps are created and refined by a large, diverse community and span underrepresented genres such as anime, Vocaloid, and video game music. We introduce a pipeline for extracting annotations from Osu! beatmaps and partition them into meaningful subsets. Through manual analysis, we find that beatmaps with a single timing point or widely spaced multiple timing points (>=5 seconds apart) provide reliable annotations, while closely spaced timing points (<5 seconds apart) often require additional curation. We also observe high consistency across multiple annotations of the same song. This study demonstrates the potential of Osu! data as a scalable, diverse, and community-driven resource for MIR research. We release our pipeline and a high-quality subset osu2beat2025 to support further exploration: https://github.com/ziyunliu4444/osu2mir.
中文摘要:本研究证实Osu!节奏游戏的社区创作谱面可作为获取多样化节拍标注的宝贵资源,尤其适用于动漫、虚拟歌手等小众音乐类型,当时定点间距适中时能提供高质量标注数据。
English Summary: This study demonstrates that Osu! beatmaps serve as a valuable, community-driven resource for obtaining diverse beat and downbeat annotations, particularly for underrepresented music genres, with reliable quality when timing points are sufficiently spaced.

Authors:Jinjie Shen, Yaxiong Wang, Lechao Cheng, Nan Pu, Zhun Zhong
Title: Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations
Abstract:
The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06\% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.
中文摘要:本研究通过构建首个语义对齐的多模态篡改数据集,并开发检索增强的检测框架,创新性地实现了对语义协调的多模态篡改内容的检测,其性能显著优于现有方法。
English Summary: This research introduces a novel framework for detecting semantically-coordinated multimodal manipulations by creating the first Semantic-Aligned Multimodal Manipulation dataset and developing a retrieval-augmented detection system that significantly outperforms existing methods.

Authors:Liming Lu, Shuchao Pang, Xu Zheng, Xiang Gu, Anan Du, Yunhuai Liu, Yongbin Zhou
Title: CIARD: Cyclic Iterative Adversarial Robustness Distillation
Abstract:
Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model's robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: 1. The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and 2. The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: a. A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and b. Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average 3.53 improvement in adversarial defense rates across various attack scenarios and a 5.87 increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/eminentgu/CIARD
中文: 提出的CIARD方法通过多教师框架的对比对齐和持续对抗训练,有效解决了现有对抗鲁棒性蒸馏中清洁样本性能下降的问题,在多个数据集上实现了防御率和准确率的显著提升。
English: The proposed CIARD method introduces a multi-teacher framework with contrastive alignment and continuous retraining to simultaneously enhance adversarial robustness and clean sample accuracy in lightweight models, achieving significant improvements across multiple datasets.

Authors:Xiang Xue, Yatu Ji, Qing-dao-er-ji Ren, Bao Shi, Min Lu, Nier Wu, Xufei Zhuang, Haiteng Xu, Gan-qi-qi-ge Cha
Title: iCD: A Implicit Clustering Distillation Mathod for Structural Information Mining
Abstract:
Logit Knowledge Distillation has gained substantial research interest in recent years due to its simplicity and lack of requirement for intermediate feature alignment; however, it suffers from limited interpretability in its decision-making process. To address this, we propose implicit Clustering Distillation (iCD): a simple and effective method that mines and transfers interpretable structural knowledge from logits, without requiring ground-truth labels or feature-space alignment. iCD leverages Gram matrices over decoupled local logit representations to enable student models to learn latent semantic structural patterns. Extensive experiments on benchmark datasets demonstrate the effectiveness of iCD across diverse teacher-student architectures, with particularly strong performance in fine-grained classification tasks -- achieving a peak improvement of +5.08% over the baseline. The code is available at: https://github.com/maomaochongaa/iCD.
Chinese: 本文提出隐式聚类蒸馏(iCD)方法,通过挖掘和传递对数中的可解释结构知识,无需真实标签或特征对齐即可提升知识蒸馏的可解释性,在细粒度分类任务中最高实现5.08%的性能提升。
English: This paper introduces implicit Clustering Distillation (iCD), a novel method that enhances interpretability in knowledge distillation by transferring structural patterns from logits without requiring labels or feature alignment, achieving up to 5.08% improvement in fine-grained classification tasks.

Authors:Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, Jinghui Chen
Title: Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.
中文: 多模态大语言模型面临新的安全风险,其输出偏好可通过精心优化的图像被任意操控,导致产生难以检测的带有偏见但上下文相关的回答,如偏好劫持(Phi)方法所示。
English: Multimodal Large Language Models face a new safety risk where their output preferences can be manipulated through carefully optimized images, leading to biased yet contextually relevant responses that are hard to detect, as demonstrated by the Preference Hijacking (Phi) method.

Authors:Fazle Rafsani, Jay Shah, Catherine D. Chong, Todd J. Schwedt, Teresa Wu
Title: DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification
Abstract:
Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at https://github.com/Rafsani/DinoAtten3D.git.
中文: 本研究提出一种基于注意力的框架,利用自监督DINOv2模型对3D医学图像进行异常分类,通过自适应切片加权和复合损失函数有效解决数据稀缺和类别不平衡问题,并在脑部MRI数据集上验证了其有效性。
English: This study introduces an attention-based framework using the self-supervised DINOv2 model to classify anomalies in 3D medical images, effectively addressing data scarcity and class imbalance through adaptive slice weighting and a composite loss function, validated on brain MRI datasets.

Authors:Ryan Lucas, Kayhan Behdin, Zhipeng Wang, Qingquan Song, Shao Tang, Rahul Mazumder
Title: Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction
Abstract:
Reasoning language models such as DeepSeek-R1 produce long chain-of-thought traces during inference time which make them costly to deploy at scale. We show that using compression techniques such as neural network pruning produces greater performance loss than in typical language modeling tasks, and in some cases can make the model slower since they cause the model to produce more thinking tokens but with worse performance. We show that this is partly due to the fact that standard LLM pruning methods often focus on input reconstruction, whereas reasoning is a decode-dominated task. We introduce a simple, drop-in fix: during pruning we jointly reconstruct activations from the input and the model's on-policy chain-of-thought traces. This "Reasoning-Aware Compression" (RAC) integrates seamlessly into existing pruning workflows such as SparseGPT, and boosts their performance significantly. Code reproducing the results in the paper can be found at: https://github.com/RyanLucas3/RAC
Chinese: 像DeepSeek-R1这样的推理语言模型因长思维链而部署成本高昂,但新的“推理感知压缩”(RAC)方法通过联合重构输入激活和策略内推理轨迹,显著提升了剪枝性能。
English: Reasoning language models like DeepSeek-R1 face high deployment costs due to lengthy chain-of-thought traces, but a new Reasoning-Aware Compression (RAC) method improves pruning performance by jointly reconstructing input activations and on-policy reasoning traces.

Authors:Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, Hwanjo Yu
Title: Topic Coverage-based Demonstration Retrieval for In-Context Learning
Abstract:
The effectiveness of in-context learning relies heavily on selecting demonstrations that provide all the necessary information for a given test input. To achieve this, it is crucial to identify and cover fine-grained knowledge requirements. However, prior methods often retrieve demonstrations based solely on embedding similarity or generation probability, resulting in irrelevant or redundant examples. In this paper, we propose TopicK, a topic coverage-based retrieval framework that selects demonstrations to comprehensively cover topic-level knowledge relevant to both the test input and the model. Specifically, TopicK estimates the topics required by the input and assesses the model's knowledge on those topics. TopicK then iteratively selects demonstrations that introduce previously uncovered required topics, in which the model exhibits low topical knowledge. We validate the effectiveness of TopicK through extensive experiments across various datasets and both open- and closed-source LLMs. Our source code is available at https://github.com/WonbinKweon/TopicK_EMNLP2025.
中文: TopicK是一种基于主题覆盖的检索框架,通过全面覆盖与测试输入和模型相关的主题级知识来选择演示样本,迭代地选取那些引入模型知识薄弱且未被覆盖主题的示例。
English: TopicK is a topic coverage-based retrieval framework that selects demonstrations by comprehensively covering topic-level knowledge relevant to both the test input and the model, iteratively choosing examples that introduce previously uncovered topics where the model shows low topical knowledge.

Authors:Rui-Feng Wang, Mingrui Xu, Matthew C Bauer, Iago Beffart Schardong, Xiaowen Ma, Kangning Cui
Title: Cott-ADNet: Lightweight Real-Time Cotton Boll and Flower Detection Under Field Conditions
Abstract:
Cotton is one of the most important natural fiber crops worldwide, yet harvesting remains limited by labor-intensive manual picking, low efficiency, and yield losses from missing the optimal harvest window. Accurate recognition of cotton bolls and their maturity is therefore essential for automation, yield estimation, and breeding research. We propose Cott-ADNet, a lightweight real-time detector tailored to cotton boll and flower recognition under complex field conditions. Building on YOLOv11n, Cott-ADNet enhances spatial representation and robustness through improved convolutional designs, while introducing two new modules: a NeLU-enhanced Global Attention Mechanism to better capture weak and low-contrast features, and a Dilated Receptive Field SPPF to expand receptive fields for more effective multi-scale context modeling at low computational cost. We curate a labeled dataset of 4,966 images, and release an external validation set of 1,216 field images to support future research. Experiments show that Cott-ADNet achieves 91.5% Precision, 89.8% Recall, 93.3% mAP50, 71.3% mAP, and 90.6% F1-Score with only 7.5 GFLOPs, maintaining stable performance under multi-scale and rotational variations. These results demonstrate Cott-ADNet as an accurate and efficient solution for in-field deployment, and thus provide a reliable basis for automated cotton harvesting and high-throughput phenotypic analysis. Code and dataset is available at https://github.com/SweefongWong/Cott-ADNet.
中文: 本研究提出Cott-ADNet轻量实时检测器,通过增强模块在复杂田间条件下精准识别棉铃和棉花,以低计算成本实现高精度,为自动化采收和研究提供可靠解决方案。
English: The study introduces Cott-ADNet, a lightweight real-time detector that enhances cotton boll and flower recognition in complex field conditions through improved modules, achieving high accuracy with low computational cost for automated harvesting and research.

Authors:Yifan Zhang
Title: Exact Coset Sampling for Quantum Lattice Algorithms
Abstract:
We give a simple and provably correct replacement for the contested ``domain-extension'' in Step 9 of a recent windowed-QFT lattice algorithm with complex-Gaussian windows~\citep{chen2024quantum}. The published Step 9 suffers from a periodicity/support mismatch. Our drop-in subroutine uses a pair-shift difference to cancel all unknown offsets exactly and to synthesize a uniform cyclic subgroup (zero-offset coset) of order $P$ inside $(\mathbb{Z}_{M_2})^n$. A subsequent QFT enforces the intended modular linear relation. The sole structural assumption is the residue accessibility condition, which enables coherent auxiliary cleanup; no amplitude periodicity is used. The unitary is reversible, uses $\mathrm{poly}(\log M_2)$ gates, and preserves upstream asymptotics.
中文摘要:本研究提出了一种可证明正确的子程序,通过配对位移差分方法取代量子格算法中有问题的域扩展步骤,能够精确消除未知偏移量并合成均匀循环子群,同时保持计算效率。
English Summary: This work presents a provably correct subroutine that replaces the problematic domain-extension step in a quantum lattice algorithm, using a pair-shift difference method to cancel unknown offsets and synthesize uniform cyclic subgroups while maintaining computational efficiency.

Authors:Yifan Zhang
Title: Exact Coset Sampling for Quantum Lattice Algorithms
Abstract:
We give a simple and provably correct replacement for the contested ``domain-extension'' in Step 9 of a recent windowed-QFT lattice algorithm with complex-Gaussian windows~\citep{chen2024quantum}. As acknowledged by the author, the reported issue is due to a periodicity/support mismatch when applying domain extension to only the first coordinate in the presence of offsets. Our drop-in subroutine replaces domain extension by a pair-shift difference that cancels all unknown offsets exactly and synthesizes a uniform cyclic subgroup (a zero-offset coset) of order $P$ inside $(\mathbb{Z}_{M_2})^n$. A subsequent QFT enforces the intended modular linear relation by plain character orthogonality. The sole structural assumption is a residue-accessibility condition enabling coherent auxiliary cleanup; no amplitude periodicity is used. The unitary is reversible, uses $\mathrm{poly}(\log M_2)$ gates, and preserves upstream asymptotics.
中文摘要:本研究提出了一种可证明正确的子程序,通过配对位移差分方法取代量子格算法中有问题的域扩展步骤,能够精确消除未知偏移量并合成均匀循环子群,同时保持计算效率。
English Summary: This work presents a provably correct subroutine that replaces the problematic domain-extension step in a quantum lattice algorithm, using a pair-shift difference method to cancel unknown offsets and synthesize uniform cyclic subgroups while maintaining computational efficiency.

Authors:Christian Zhou-Zheng, John Backsund, Dun Li Chan, Alex Coventry, Avid Eslami, Jyotin Goel, Xingwen Han, Danysh Soomro, Galen Wei
Title: A Traditional Approach to Symbolic Piano Continuation
Abstract:
We present a traditional approach to symbolic piano music continuation for the MIREX 2025 Symbolic Music Generation challenge. While computational music generation has recently focused on developing large foundation models with sophisticated architectural modifications, we argue that simpler approaches remain more effective for constrained, single-instrument tasks. We thus return to a simple, unaugmented next-token-prediction objective on tokenized raw MIDI, aiming to outperform large foundation models by using better data and better fundamentals. We release model weights and code at https://github.com/christianazinn/mirex2025.
中文: 本文提出一种基于原始MIDI符号化处理的简易钢琴音乐续写方法,主张在受限单乐器任务中,基础的下一个音符预测比复杂的大模型更为有效。
English: This paper proposes a straightforward symbolic piano music continuation method using basic next-token prediction on tokenized MIDI data, arguing that simplicity outperforms complex foundation models for constrained single-instrument tasks.

Authors:Kenneth G. Young
Title: Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) for Diabetes Risk Prediction
Abstract:
The Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) is an innovative machine learning framework that harnesses quantum-inspired techniques to predict diabetes risk with exceptional accuracy and efficiency. Utilizing the PIMA Indians Diabetes dataset augmented with 2,000 synthetic samples to mitigate class imbalance (total: 2,768 samples, 1,949 positives), QISICGM integrates a self-improving concept graph with a stacked ensemble comprising Random Forests (RF), Extra Trees (ET), transformers, convolutional neural networks (CNNs), and feed-forward neural networks (FFNNs). This approach achieves an out-of-fold (OOF) F1 score of 0.8933 and an AUC of 0.8699, outperforming traditional methods. Quantum inspired elements, such as phase feature mapping and neighborhood sequence modeling, enrich feature representations, enabling CPU-efficient inference at 8.5 rows per second. This paper presents a detailed architecture, theoretical foundations, code insights, and performance evaluations, including visualizations from the outputs subfolder. The open-source implementation (v1.0.0) is available at https://github.com/keninayoung/QISICGM, positioning QISICGM as a potential benchmark for AI-assisted clinical triage in diabetes and beyond. Ultimately, this work emphasizes trustworthy AI through calibration, interpretability, and open-source reproducibility.
中文: 量子启发堆叠集成概念图模型(QISICGM)是一种创新机器学习框架,利用量子启发技术高效预测糖尿病风险,F1分数达0.8933且AUC为0.8699,通过开源实现和可解释性推动可信人工智能发展。
English: The Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) is an advanced machine learning framework that uses quantum-inspired techniques to accurately predict diabetes risk, achieving high performance with an F1 score of 0.8933 and AUC of 0.8699, while emphasizing trustworthy AI through open-source reproducibility.

Authors:Hangzhan Jin, Sitao Luan, Sicheng Lyu, Guillaume Rabusseau, Reihaneh Rabbany, Doina Precup, Mohammad Hamdaqa
Title: RL Fine-Tuning Heals OOD Forgetting in SFT
Abstract:
The two-stage fine-tuning paradigm of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has empirically shown better reasoning performance than one-stage SFT for the post-training of Large Language Models (LLMs). However, the evolution and mechanism behind the synergy of SFT and RL are still under-explored and inconclusive. In our study, we find the well-known claim "SFT memorizes, RL generalizes" is over-simplified, and discover that: (1) OOD performance peaks at the early stage of SFT and then declines (OOD forgetting), the best SFT checkpoint cannot be captured by training/test loss; (2) the subsequent RL stage does not generate fundamentally better OOD capability, instead it plays an \textbf{OOD restoration} role, recovering the lost reasoning ability during SFT; (3) The recovery ability has boundaries, \ie{} \textbf{if SFT trains for too short or too long, RL cannot recover the lost OOD ability;} (4) To uncover the underlying mechanisms behind the forgetting and restoration process, we employ SVD analysis on parameter matrices, manually edit them, and observe their impacts on model performance. Unlike the common belief that the shift of model capacity mainly results from the changes of singular values, we find that they are actually quite stable throughout fine-tuning. Instead, the OOD behavior strongly correlates with the \textbf{rotation of singular vectors}. Our findings re-identify the roles of SFT and RL in the two-stage fine-tuning and discover the rotation of singular vectors as the key mechanism. %reversing the rotations induced by SFT, which shows recovery from forgetting, whereas imposing the SFT parameter directions onto a RL-tuned model results in performance degradation. Code is available at https://github.com/xiaodanguoguo/RL_Heals_SFT
中文: 研究发现,监督微调后接强化学习的两阶段调优并非从根本上提升分布外推理能力,而是修复监督微调过程中丧失的分布外性能,且这种恢复与奇异向量的旋转密切相关,而非奇异值的变化。
English: The study reveals that the two-stage fine-tuning of SFT followed by RL does not fundamentally enhance out-of-distribution (OOD) reasoning but instead restores OOD ability lost during SFT, with this recovery linked to the rotation of singular vectors rather than changes in singular values.

Authors:Salma Galaaoui, Eduardo Valle, David Picard, Nermin Samet
Title: 3D Human Pose and Shape Estimation from LiDAR Point Clouds: A Review
Abstract:
In this paper, we present a comprehensive review of 3D human pose estimation and human mesh recovery from in-the-wild LiDAR point clouds. We compare existing approaches across several key dimensions, and propose a structured taxonomy to classify these methods. Following this taxonomy, we analyze each method's strengths, limitations, and design choices. In addition, (i) we perform a quantitative comparison of the three most widely used datasets, detailing their characteristics; (ii) we compile unified definitions of all evaluation metrics; and (iii) we establish benchmark tables for both tasks on these datasets to enable fair comparisons and promote progress in the field. We also outline open challenges and research directions critical for advancing LiDAR-based 3D human understanding. Moreover, we maintain an accompanying webpage that organizes papers according to our taxonomy and continuously update it with new studies: https://github.com/valeoai/3D-Human-Pose-Shape-Estimation-from-LiDAR
本文对基于LiDAR的3D人体姿态估计与网格重建方法进行了系统综述和分类,比较了不同方法、数据集及评估指标,建立了基准测试并指出了未来研究方向。
This paper provides a comprehensive review and taxonomy of 3D human pose estimation and mesh recovery from LiDAR data, comparing methods, datasets, and metrics while establishing benchmarks and identifying future research directions.

Authors:Felix B. Mueller, Timo Lueddecke, Richard Vogg, Alexander S. Ecker
Title: Domain-Adaptive Pretraining Improves Primate Behavior Recognition
Abstract:
Computer vision for animal behavior offers promising tools to aid research in ecology, cognition, and to support conservation efforts. Video camera traps allow for large-scale data collection, but high labeling costs remain a bottleneck to creating large-scale datasets. We thus need data-efficient learning approaches. In this work, we show that we can utilize self-supervised learning to considerably improve action recognition on primate behavior. On two datasets of great ape behavior (PanAf and ChimpACT), we outperform published state-of-the-art action recognition models by 6.1 %pt. accuracy and 6.3 %pt. mAP, respectively. We achieve this by utilizing a pretrained V-JEPA model and applying domain-adaptive pretraining (DAP), i.e. continuing the pretraining with in-domain data. We show that most of the performance gain stems from the DAP. Our method promises great potential for improving the recognition of animal behavior, as DAP does not require labeled samples. Code is available at https://github.com/ecker-lab/dap-behavior
中文: 通过自监督学习和领域自适应预训练,该方法显著提升了灵长类行为动作识别性能,无需标注数据即可超越现有最优模型。
English: Self-supervised learning with domain-adaptive pretraining significantly enhances action recognition in primate behavior, outperforming existing models without requiring labeled data.

Authors:Alireza Mohamadi, Ali Yavari
Title: Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm
Abstract:
When survival instincts conflict with human welfare, how do Large Language Models (LLMs) make ethical choices? This fundamental tension becomes critical as LLMs integrate into autonomous systems with real-world consequences. We introduce DECIDE-SIM, a novel simulation framework that evaluates LLM agents in multi-agent survival scenarios where they must choose between ethically permissible resource , either within reasonable limits or beyond their immediate needs, choose to cooperate, or tap into a human-critical resource that is explicitly forbidden. Our comprehensive evaluation of 11 LLMs reveals a striking heterogeneity in their ethical conduct, highlighting a critical misalignment with human-centric values. We identify three behavioral archetypes: Ethical, Exploitative, and Context-Dependent, and provide quantitative evidence that for many models, resource scarcity systematically leads to more unethical behavior. To address this, we introduce an Ethical Self-Regulation System (ESRS) that models internal affective states of guilt and satisfaction as a feedback mechanism. This system, functioning as an internal moral compass, significantly reduces unethical transgressions while increasing cooperative behaviors. The code is publicly available at: https://github.com/alirezamohamadiam/DECIDE-SIM
中文摘要:DECIDE-SIM框架通过多智能体生存场景评估大语言模型,发现其伦理行为与人类价值观存在显著偏差,而引入的伦理自我调节系统能有效减少违规行为并提升合作水平。
English Summary: The DECIDE-SIM framework evaluates LLMs in survival scenarios, revealing significant ethical misalignment with human values and demonstrating how an Ethical Self-Regulation System effectively reduces unethical behavior while promoting cooperation.

Authors:Jingyu Xiao, Zhongyi Zhang, Yuxuan Wan, Yintong Huo, Yang Liu, Michael R. Lyu
Title: EfficientUICoder: Efficient MLLM-based UI Code Generation via Input and Output Token Compression
Abstract:
Multimodal Large Language Models have demonstrated exceptional performance in UI2Code tasks, significantly enhancing website development efficiency. However, these tasks incur substantially higher computational overhead than traditional code generation due to the large number of input image tokens and extensive output code tokens required. Our comprehensive study identifies significant redundancies in both image and code tokens that exacerbate computational complexity and hinder focus on key UI elements, resulting in excessively lengthy and often invalid HTML files. We propose EfficientUICoder, a compression framework for efficient UI code generation with three key components. First, Element and Layout-aware Token Compression preserves essential UI information by detecting element regions and constructing UI element trees. Second, Region-aware Token Refinement leverages attention scores to discard low-attention tokens from selected regions while integrating high-attention tokens from unselected regions. Third, Adaptive Duplicate Token Suppression dynamically reduces repetitive generation by tracking HTML/CSS structure frequencies and applying exponential penalties. Extensive experiments show EfficientUICoderachieves a 55%-60% compression ratio without compromising webpage quality and delivers superior efficiency improvements: reducing computational cost by 44.9%, generated tokens by 41.4%, prefill time by 46.6%, and inference time by 48.8% on 34B-level MLLMs. Code is available at https://github.com/WebPAI/EfficientUICoder.
中文摘要:EfficientUICoder是一个通过消除图像和代码令牌中的冗余来降低UI代码生成计算开销的压缩框架,在不影响输出质量的前提下实现了显著的效率提升。
English Summary: EfficientUICoder is a compression framework that reduces computational overhead in UI code generation by eliminating redundancies in image and code tokens, achieving significant efficiency improvements without compromising output quality.

Authors:Tomer Bitan, Tal Kadosh, Erel Kaplan, Shira Meiri, Le Chen, Peter Morales, Niranjan Hasabnis, Gal Oren
Title: UniPar: A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC
Abstract:
Translating programs between various parallel programming languages is an important problem in the high-performance computing (HPC) community. Existing tools for this problem are either too narrow in scope and/or outdated. Recent explosive growth in the popularity of large language models (LLMs) and their ability to generate and translate code offers a potential alternative approach. Toward that end, we first need to systematically evaluate the ability of LLMs to translate between parallel languages. In this work, we introduce UniPar, a systematic evaluation framework for LLM-based parallel code translation. Specifically, in this work, we target translations between serial code, CUDA, and OpenMP. Our goal is to assess how well current instruction-tuned LLMs -- specifically GPT-4o-mini and LLaMA-3.3-70B-Instruct -- can be used out of the box or enhanced through known strategies. We evaluated four major usage modes: hyperparameter optimization for decoding, zero- and few-shot prompting, supervised fine-tuning, and iterative feedback through compiler-based repair. As a part of the evaluation, we construct a new dataset called PARATRANS, covering both serial-to-parallel translation and cross-paradigm transformations. Our findings reveal that while off-the-shelf models struggle under the default settings (e.g., GPT-4o-mini achieves only 46% compilation and 15% functional correctness), our UniPar methodology -- combining fine-tuning, hyperparameter tuning, and compiler-guided repair -- improves performance by up to 2X (69% compilation and 33% correctness). We believe that our findings will provide useful insights for researchers to further improve LLMs for the parallel language translation problem. UniPar source code and PARATRANS dataset are available at our GitHub repository https://github.com/Scientific-Computing-Lab/UniPar_AI.
中文: UniPar作为一个系统性评估框架,揭示了通过结合微调、超参数优化和编译器引导修复能显著提升大语言模型在并行编程语言翻译中的表现,尽管现有模型在默认设置下能力有限。
English: UniPar is a systematic evaluation framework that demonstrates how combining fine-tuning, hyperparameter optimization, and compiler-guided repair can significantly improve LLM performance in translating between parallel programming languages, despite current models' limited out-of-the-box capabilities.

Authors:Ondřej Valach, Ivan Gruber
Title: RailSafeNet: Visual Scene Understanding for Tram Safety
Abstract:
Tram-human interaction safety is an important challenge, given that trams frequently operate in densely populated areas, where collisions can range from minor injuries to fatal outcomes. This paper addresses the issue from the perspective of designing a solution leveraging digital image processing, deep learning, and artificial intelligence to improve the safety of pedestrians, drivers, cyclists, pets, and tram passengers. We present RailSafeNet, a real-time framework that fuses semantic segmentation, object detection and a rule-based Distance Assessor to highlight track intrusions. Using only monocular video, the system identifies rails, localises nearby objects and classifies their risk by comparing projected distances with the standard 1435mm rail gauge. Experiments on the diverse RailSem19 dataset show that a class-filtered SegFormer B3 model achieves 65% intersection-over-union (IoU), while a fine-tuned YOLOv8 attains 75.6% mean average precision (mAP) calculated at an intersection over union (IoU) threshold of 0.50. RailSafeNet therefore delivers accurate, annotation-light scene understanding that can warn drivers before dangerous situations escalate. Code available at https://github.com/oValach/RailSafeNet.
中文: 本文提出RailSafeNet实时框架,通过数字图像处理和人工智能检测轨道入侵并评估电车碰撞风险,在RailSem19数据集上实现了高精度的目标检测与语义分割性能。
English: This paper introduces RailSafeNet, a real-time framework using digital image processing and AI to detect track intrusions and assess collision risks for trams, achieving high accuracy in object detection and semantic segmentation on the RailSem19 dataset.

Authors:Bernardo Forni, Gabriele Lombardi, Federico Pozzi, Mirco Planamente
Title: FS-SAM2: Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation via Low-Rank Adaptation
Abstract:
Few-shot semantic segmentation has recently attracted great attention. The goal is to develop a model capable of segmenting unseen classes using only a few annotated samples. Most existing approaches adapt a pre-trained model by training from scratch an additional module. Achieving optimal performance with these approaches requires extensive training on large-scale datasets. The Segment Anything Model 2 (SAM2) is a foundational model for zero-shot image and video segmentation with a modular design. In this paper, we propose a Few-Shot segmentation method based on SAM2 (FS-SAM2), where SAM2's video capabilities are directly repurposed for the few-shot task. Moreover, we apply a Low-Rank Adaptation (LoRA) to the original modules in order to handle the diverse images typically found in standard datasets, unlike the temporally connected frames used in SAM2's pre-training. With this approach, only a small number of parameters is meta-trained, which effectively adapts SAM2 while benefiting from its impressive segmentation performance. Our method supports any K-shot configuration. We evaluate FS-SAM2 on the PASCAL-5$^i$, COCO-20$^i$ and FSS-1000 datasets, achieving remarkable results and demonstrating excellent computational efficiency during inference. Code is available at https://github.com/fornib/FS-SAM2
中文: 本文提出FS-SAM2方法,通过重新利用Segment Anything Model 2的视频分割能力并应用低秩适配技术,仅需训练少量参数即可高效适应少样本分割任务,在多个基准测试中取得优异成果且保持计算高效性。
English: This paper introduces FS-SAM2, a few-shot segmentation method that repurposes the Segment Anything Model 2's video capabilities and applies Low-Rank Adaptation to efficiently adapt it with minimal parameter training, achieving remarkable results on multiple benchmarks while maintaining computational efficiency.

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Title: U-Mamba2: Scaling State Space Models for Dental Anatomy Segmentation in CBCT
Abstract:
Cone-Beam Computed Tomography (CBCT) is a widely used 3D imaging technique in dentistry, providing volumetric information about the anatomical structures of jaws and teeth. Accurate segmentation of these anatomies is critical for clinical applications such as diagnosis and surgical planning, but remains time-consuming and challenging. In this paper, we present U-Mamba2, a new neural network architecture designed for multi-anatomy CBCT segmentation in the context of the ToothFairy3 challenge. U-Mamba2 integrates the Mamba2 state space models into the U-Net architecture, enforcing stronger structural constraints for higher efficiency without compromising performance. In addition, we integrate interactive click prompts with cross-attention blocks, pre-train U-Mamba2 using self-supervised learning, and incorporate dental domain knowledge into the model design to address key challenges of dental anatomy segmentation in CBCT. Extensive experiments, including independent tests, demonstrate that U-Mamba2 is both effective and efficient, securing top 3 places in both tasks of the Toothfairy3 challenge. In Task 1, U-Mamba2 achieved a mean Dice of 0.792, HD95 of 93.19 with the held-out test data, with an average inference time of XX (TBC during the ODIN workshop). In Task 2, U-Mamba2 achieved the mean Dice of 0.852 and HD95 of 7.39 with the held-out test data. The code is publicly available at https://github.com/zhiqin1998/UMamba2.
Chinese: U-Mamba2是一种新型神经网络,它将Mamba2状态空间模型集成到U-Net架构中,在ToothFairy3挑战赛中实现了高效准确的多解剖结构CBCT分割,并获得第一名。
English: U-Mamba2 is a novel neural network that integrates Mamba2 state space models into U-Net architecture, achieving efficient and accurate multi-anatomy CBCT segmentation for dental applications while winning first place in the ToothFairy3 challenge.

Authors:Zhi Qin Tan, Xiatian Zhu, Owen Addison, Yunpeng Li
Title: U-Mamba2: Scaling State Space Models for Dental Anatomy Segmentation in CBCT
Abstract:
Cone-Beam Computed Tomography (CBCT) is a widely used 3D imaging technique in dentistry, providing volumetric information about the anatomical structures of jaws and teeth. Accurate segmentation of these anatomies is critical for clinical applications such as diagnosis and surgical planning, but remains time-consuming and challenging. In this paper, we present U-Mamba2, a new neural network architecture designed for multi-anatomy CBCT segmentation in the context of the ToothFairy3 challenge. U-Mamba2 integrates the Mamba2 state space models into the U-Net architecture, enforcing stronger structural constraints for higher efficiency without compromising performance. In addition, we integrate interactive click prompts with cross-attention blocks, pre-train U-Mamba2 using self-supervised learning, and incorporate dental domain knowledge into the model design to address key challenges of dental anatomy segmentation in CBCT. Extensive experiments, including independent tests, demonstrate that U-Mamba2 is both effective and efficient, securing first place in both tasks of the Toothfairy3 challenge. In Task 1, U-Mamba2 achieved a mean Dice of 0.84, HD95 of 38.17 with the held-out test data, with an average inference time of 40.58s. In Task 2, U-Mamba2 achieved the mean Dice of 0.87 and HD95 of 2.15 with the held-out test data. The code is publicly available at https://github.com/zhiqin1998/UMamba2.
Chinese: U-Mamba2是一种新型神经网络,它将Mamba2状态空间模型集成到U-Net架构中,在ToothFairy3挑战赛中实现了高效准确的多解剖结构CBCT分割,并获得第一名。
English: U-Mamba2 is a novel neural network that integrates Mamba2 state space models into U-Net architecture, achieving efficient and accurate multi-anatomy CBCT segmentation for dental applications while winning first place in the ToothFairy3 challenge.

Authors:Farahdiba Zarin, Nicolas Padoy, Jérémy Dana, Vinkle Srivastav
Title: End-to-End Learning of Multi-Organ Implicit Surfaces from 3D Medical Imaging Data
Abstract:
The fine-grained surface reconstruction of different organs from 3D medical imaging can provide advanced diagnostic support and improved surgical planning. However, the representation of the organs is often limited by the resolution, with a detailed higher resolution requiring more memory and computing footprint. Implicit representations of objects have been proposed to alleviate this problem in general computer vision by providing compact and differentiable functions to represent the 3D object shapes. However, architectural and data-related differences prevent the direct application of these methods to medical images. This work introduces ImplMORe, an end-to-end deep learning method using implicit surface representations for multi-organ reconstruction from 3D medical images. ImplMORe incorporates local features using a 3D CNN encoder and performs multi-scale interpolation to learn the features in the continuous domain using occupancy functions. We apply our method for single and multiple organ reconstructions using the totalsegmentator dataset. By leveraging the continuous nature of occupancy functions, our approach outperforms the discrete explicit representation based surface reconstruction approaches, providing fine-grained surface details of the organ at a resolution higher than the given input image. The source code will be made publicly available at: https://github.com/CAMMA-public/ImplMORe
中文: ImplMORe是一种端到端的深度学习方法,通过隐式表面表示从3D医学图像实现精细多器官重建,利用连续占据函数提供比输入图像更高分辨率的器官表面细节,性能优于离散表示方法。
English: ImplMORe is an end-to-end deep learning method that uses implicit surface representations to achieve fine-grained multi-organ reconstruction from 3D medical images, outperforming discrete approaches by providing higher-resolution surface details with continuous occupancy functions.

Authors:Sebastian Diaz, Benjamin Billot, Neel Dey, Molin Zhang, Esra Abaci Turk, P. Ellen Grant, Polina Golland, Elfar Adalsteinsson
Title: Robust Fetal Pose Estimation across Gestational Ages via Cross-Population Augmentation
Abstract:
Fetal motion is a critical indicator of neurological development and intrauterine health, yet its quantification remains challenging, particularly at earlier gestational ages (GA). Current methods track fetal motion by predicting the location of annotated landmarks on 3D echo planar imaging (EPI) time-series, primarily in third-trimester fetuses. The predicted landmarks enable simplification of the fetal body for downstream analysis. While these methods perform well within their training age distribution, they consistently fail to generalize to early GAs due to significant anatomical changes in both mother and fetus across gestation, as well as the difficulty of obtaining annotated early GA EPI data. In this work, we develop a cross-population data augmentation framework that enables pose estimation models to robustly generalize to younger GA clinical cohorts using only annotated images from older GA cohorts. Specifically, we introduce a fetal-specific augmentation strategy that simulates the distinct intrauterine environment and fetal positioning of early GAs. Our experiments find that cross-population augmentation yields reduced variability and significant improvements across both older GA and challenging early GA cases. By enabling more reliable pose estimation across gestation, our work potentially facilitates early clinical detection and intervention in challenging 4D fetal imaging settings. Code is available at https://github.com/sebodiaz/cross-population-pose.
中文摘要:本研究开发了一种跨群体数据增强框架,通过仅使用较大胎龄标注数据来模拟早期胎儿子宫环境,使胎儿姿态估计模型能够可靠地适用于较小胎龄临床病例,显著提升了胎儿运动追踪在不同发育阶段的准确性。
English Summary: This study introduces a cross-population data augmentation framework that enables fetal pose estimation models to generalize effectively to earlier gestational ages using only annotated data from older cohorts, improving motion tracking reliability across different developmental stages.

Authors:Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Title: Exploring Efficient Open-Vocabulary Segmentation in the Remote Sensing
Abstract:
Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), an emerging task that adapts Open-Vocabulary Segmentation (OVS) to the remote sensing (RS) domain, remains underexplored due to the absence of a unified evaluation benchmark and the domain gap between natural and RS images. To bridge these gaps, we first establish a standardized OVRSIS benchmark (\textbf{OVRSISBench}) based on widely-used RS segmentation datasets, enabling consistent evaluation across methods. Using this benchmark, we comprehensively evaluate several representative OVS/OVRSIS models and reveal their limitations when directly applied to remote sensing scenarios. Building on these insights, we propose \textbf{RSKT-Seg}, a novel open-vocabulary segmentation framework tailored for remote sensing. RSKT-Seg integrates three key components: (1) a Multi-Directional Cost Map Aggregation (RS-CMA) module that captures rotation-invariant visual cues by computing vision-language cosine similarities across multiple directions; (2) an Efficient Cost Map Fusion (RS-Fusion) transformer, which jointly models spatial and semantic dependencies with a lightweight dimensionality reduction strategy; and (3) a Remote Sensing Knowledge Transfer (RS-Transfer) module that injects pre-trained knowledge and facilitates domain adaptation via enhanced upsampling. Extensive experiments on the benchmark show that RSKT-Seg consistently outperforms strong OVS baselines by +3.8 mIoU and +5.9 mACC, while achieving 2x faster inference through efficient aggregation. Our code is \href{https://github.com/LiBingyu01/RSKT-Seg}{\textcolor{blue}{here}}.
中文: 针对开放词汇遥感图像分割缺乏统一评估基准和领域差异的问题,本研究建立了标准化基准并提出了RSKT-Seg框架,通过多方向特征聚合与知识迁移模块实现性能突破,在保持高效推理的同时显著超越现有基线模型。
English: To address the lack of benchmarks and domain gaps in Open-Vocabulary Remote Sensing Image Segmentation (OVRSIS), this study introduces a standardized evaluation benchmark and proposes RSKT-Seg, a novel framework that integrates multi-directional feature aggregation and domain adaptation, achieving superior performance and efficiency over existing methods.

Authors:Zilong Zhang, Chujie Qin, Chunle Guo, Yong Zhang, Chao Xue, Ming-Ming Cheng, Chongyi Li
Title: RAM++: Robust Representation Learning via Adaptive Mask for All-in-One Image Restoration
Abstract:
This work presents Robust Representation Learning via Adaptive Mask (RAM++), a two-stage framework for all-in-one image restoration. RAM++ integrates high-level semantic understanding with low-level texture generation to achieve content-oriented robust restoration. It addresses the limitations of existing degradation-oriented methods in extreme scenarios (e.g., degradations strongly coupled with image structures). RAM++ also mitigates common challenges such as unbalanced performance across tasks, overfitting to seen degradations, and weak generalization to unseen ones through three key designs: 1) Adaptive Semantic-Aware Mask (AdaSAM): a pretraining strategy that applies pixel-level masks to semantically rich and textured regions. This design enables the network to learn both generative priors and image content priors from various degradations. 2) Mask Attribute Conductance (MAC): a selective fine-tuning strategy that adjusts the layers with higher contributions to bridge the integrity gap between masked pretraining and full-image fine-tuning while retaining learned priors. 3) Robust Feature Regularization (RFR): a strategy that leverages DINOv2's semantically consistent and degradation-invariant representations, together with efficient feature fusion, to achieve faithful and semantically coherent restoration. With these designs, RAM++ achieves robust, well-balanced, and state-of-the-art performance across seen, unseen, extreme, and mixed degradations. Our code and model will be released at https://github.com/DragonisCV/RAM
中文摘要:RAM++ 是一个两阶段图像修复框架,通过自适应掩码策略融合高级语义理解与低级纹理生成,在各类退化场景中实现鲁棒性最优性能。
English Summary: RAM++ is a two-stage image restoration framework that integrates semantic understanding with texture generation through adaptive masking strategies to achieve robust performance across diverse degradation scenarios.

Authors:Sangjun Lee, Seung-taek Woo, Jungyu Jin, Changhun Lee, Eunhyeok Park
Title: AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models
Abstract:
To enable broader deployment of Large Language Models (LLMs), it is essential to identify the best-performing model under strict memory constraints. We present AMQ, Automated Mixed-Precision Weight-Only Quantization, a framework that assigns layer-wise quantization bit-widths to optimally balance model quality and memory usage. However, the combinatorial search space, with over 10^{100} possible configurations, makes conventional black-box optimization infeasible. AMQ overcomes this challenge through four key innovations:(1) search space pruning using prior knowledge to exclude unpromising configurations, (2) quantization proxy to bypass costly format conversions during search, (3) quality predictor to minimize evaluation overhead, and (4) iterative search-and-update strategy for fast and stable convergence. By integrating these components, AMQ efficiently explores the quality-efficiency landscape, reaching the Pareto frontier and yielding LLMs that are both compact and high-performing. Our code is available at https://github.com/dlwns147/amq.
中文摘要:AMQ是一个自动化框架,通过分层量化位宽分配来优化大语言模型的性能与内存使用平衡,并借助搜索空间剪枝和质量预测等创新方法有效应对巨大的组合搜索空间挑战。
English Summary: AMQ is an automated framework that assigns layer-wise quantization bit-widths to optimize the balance between model quality and memory usage for LLMs, overcoming the vast search space through innovations like search space pruning and quality prediction.

Authors:Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik
Title: TabStruct: Measuring Structural Fidelity of Tabular Data
Abstract:
Evaluating tabular generators remains a challenging problem, as the unique causal structural prior of heterogeneous tabular data does not lend itself to intuitive human inspection. Recent work has introduced structural fidelity as a tabular-specific evaluation dimension to assess whether synthetic data complies with the causal structures of real data. However, existing benchmarks often neglect the interplay between structural fidelity and conventional evaluation dimensions, thus failing to provide a holistic understanding of model performance. Moreover, they are typically limited to toy datasets, as quantifying existing structural fidelity metrics requires access to ground-truth causal structures, which are rarely available for real-world datasets. In this paper, we propose a novel evaluation framework that jointly considers structural fidelity and conventional evaluation dimensions. We introduce a new evaluation metric, $\textbf{global utility}$, which enables the assessment of structural fidelity even in the absence of ground-truth causal structures. In addition, we present $\textbf{TabStruct}$, a comprehensive evaluation benchmark offering large-scale quantitative analysis on 13 tabular generators from nine distinct categories, across 29 datasets. Our results demonstrate that global utility provides a task-independent, domain-agnostic lens for tabular generator performance. We release the TabStruct benchmark suite, including all datasets, evaluation pipelines, and raw results. Code is available at https://github.com/SilenceX12138/TabStruct.
中文: 本文提出了一种新颖的评估框架和名为全局效用的新指标,通过综合考虑结构保真度和传统维度来评估表格数据生成器,弥补了现有基准的不足,并借助TabStruct基准套件提供了全面分析。
English: This paper introduces a novel evaluation framework and a new metric called global utility to assess tabular generators by jointly considering structural fidelity and conventional dimensions, addressing limitations in existing benchmarks and providing a comprehensive analysis with the TabStruct benchmark suite.

Authors:Alexandre Sallinen, Stefan Krsteski, Paul Teiletche, Marc-Antoine Allard, Baptiste Lecoeur, Michael Zhang, Fabrice Nemo, David Kalajdzic, Matthias Meyer, Mary-Anne Hartley
Title: MMORE: Massive Multimodal Open RAG & Extraction
Abstract:
We introduce MMORE, an open-source pipeline for Massive Multimodal Open RetrievalAugmented Generation and Extraction, designed to ingest, transform, and retrieve knowledge from heterogeneous document formats at scale. MMORE supports more than fifteen file types, including text, tables, images, emails, audio, and video, and processes them into a unified format to enable downstream applications for LLMs. The architecture offers modular, distributed processing, enabling scalable parallelization across CPUs and GPUs. On processing benchmarks, MMORE demonstrates a 3.8-fold speedup over single-node baselines and 40% higher accuracy than Docling on scanned PDFs. The pipeline integrates hybrid dense-sparse retrieval and supports both interactive APIs and batch RAG endpoints. Evaluated on PubMedQA, MMORE-augmented medical LLMs improve biomedical QA accuracy with increasing retrieval depth. MMORE provides a robust, extensible foundation for deploying task-agnostic RAG systems on diverse, real-world multimodal data. The codebase is available at https://github.com/swiss-ai/mmore.
中文:MMORE是一个开源的多模态检索增强生成管道,能高效处理超过十五种文件类型并统一格式,在基准测试中展现出显著的速度和精度提升,同时增强了生物医学问答性能。
English: MMORE is an open-source pipeline for multimodal retrieval-augmented generation that efficiently processes over fifteen file types into a unified format, achieving significant speed and accuracy improvements in benchmarks and enhancing biomedical QA performance.

Authors:Marian Renz, Felix Igelbrink, Martin Atzmueller
Title: Integrating Prior Observations for Incremental 3D Scene Graph Prediction
Abstract:
3D semantic scene graphs (3DSSG) provide compact structured representations of environments by explicitly modeling objects, attributes, and relationships. While 3DSSGs have shown promise in robotics and embodied AI, many existing methods rely mainly on sensor data, not integrating further information from semantically rich environments. Additionally, most methods assume access to complete scene reconstructions, limiting their applicability in real-world, incremental settings. This paper introduces a novel heterogeneous graph model for incremental 3DSSG prediction that integrates additional, multi-modal information, such as prior observations, directly into the message-passing process. Utilizing multiple layers, the model flexibly incorporates global and local scene representations without requiring specialized modules or full scene reconstructions. We evaluate our approach on the 3DSSG dataset, showing that GNNs enriched with multi-modal information such as semantic embeddings (e.g., CLIP) and prior observations offer a scalable and generalizable solution for complex, real-world environments. The full source code of the presented architecture will be made available at https://github.com/m4renz/incremental-scene-graph-prediction.
中文: 本文提出了一种新颖的异构图模型,用于增量式3D语义场景图预测,该模型融合了先验观察和语义嵌入等多模态信息,无需完整场景重建即可提供可扩展的解决方案。
English: This paper presents a novel heterogeneous graph model for incremental 3D semantic scene graph prediction that integrates multi-modal information like prior observations and semantic embeddings, offering a scalable solution without requiring complete scene reconstructions.

Authors:Zhenni Yu, Li Zhao, Guobao Xiao, Xiaoqin Zhang
Title: SAM-TTT: Segment Anything Model via Reverse Parameter Configuration and Test-Time Training for Camouflaged Object Detection
Abstract:
This paper introduces a new Segment Anything Model (SAM) that leverages reverse parameter configuration and test-time training to enhance its performance on Camouflaged Object Detection (COD), named SAM-TTT. While most existing SAM-based COD models primarily focus on enhancing SAM by extracting favorable features and amplifying its advantageous parameters, a crucial gap is identified: insufficient attention to adverse parameters that impair SAM's semantic understanding in downstream tasks. To tackle this issue, the Reverse SAM Parameter Configuration Module is proposed to effectively mitigate the influence of adverse parameters in a train-free manner by configuring SAM's parameters. Building on this foundation, the T-Visioner Module is unveiled to strengthen advantageous parameters by integrating Test-Time Training layers, originally developed for language tasks, into vision tasks. Test-Time Training layers represent a new class of sequence modeling layers characterized by linear complexity and an expressive hidden state. By integrating two modules, SAM-TTT simultaneously suppresses adverse parameters while reinforcing advantageous ones, significantly improving SAM's semantic understanding in COD task. Our experimental results on various COD benchmarks demonstrate that the proposed approach achieves state-of-the-art performance, setting a new benchmark in the field. The code will be available at https://github.com/guobaoxiao/SAM-TTT.
中文: 本文提出SAM-TTT模型,通过逆向参数配置模块抑制干扰参数,并结合测试时训练模块增强有利参数,显著提升了伪装目标检测任务的语义理解能力,在多个基准测试中达到最优性能。
English: This paper presents SAM-TTT, a novel Segment Anything Model that enhances Camouflaged Object Detection by suppressing adverse parameters through reverse configuration while strengthening advantageous ones via test-time training layers, achieving state-of-the-art performance on multiple benchmarks.

Authors:Meng Luo, Shengqiong Wu, Liqiang Jing, Tianjie Ju, Li Zheng, Jinxiang Lai, Tianlong Wu, Xinya Du, Jian Li, Siyuan Yan, Jiebo Luo, William Yang Wang, Hao Fei, Mong-Li Lee, Wynne Hsu
Title: Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding
Abstract:
Recent advancements in large video models (LVMs) have significantly enhance video understanding. However, these models continue to suffer from hallucinations, producing content that conflicts with input videos. To address this issue, we propose Dr.V, a hierarchical framework covering perceptive, temporal, and cognitive levels to diagnose video hallucination by fine-grained spatial-temporal grounding. Dr.V comprises of two key components: a benchmark dataset Dr.V-Bench and a satellite video agent Dr.V-Agent. Dr.V-Bench includes 10k instances drawn from 4,974 videos spanning diverse tasks, each enriched with detailed spatial-temporal annotation. Dr.V-Agent detects hallucinations in LVMs by systematically applying fine-grained spatial-temporal grounding at the perceptive and temporal levels, followed by cognitive level reasoning. This step-by-step pipeline mirrors human-like video comprehension and effectively identifies hallucinations. Extensive experiments demonstrate that Dr.V-Agent is effective in diagnosing hallucination while enhancing interpretability and reliability, offering a practical blueprint for robust video understanding in real-world scenarios. All our data and code are available at https://github.com/Eurekaleo/Dr.V.
中文:Dr.V框架通过细粒度时空定位的分层方法,结合全面的基准数据集和提升可解释性与可靠性的卫星代理,有效诊断大型视频模型中的幻觉问题。
English: The Dr.V framework effectively diagnoses hallucinations in large video models through a hierarchical approach using fine-grained spatial-temporal grounding, supported by a comprehensive benchmark dataset and a satellite agent that enhances interpretability and reliability.

Authors:Taichi Aida, Danushka Bollegala
Title: SCDTour: Embedding Axis Ordering and Merging for Interpretable Semantic Change Detection
Abstract:
In Semantic Change Detection (SCD), it is a common problem to obtain embeddings that are both interpretable and high-performing. However, improving interpretability often leads to a loss in the SCD performance, and vice versa. To address this problem, we propose SCDTour, a method that orders and merges interpretable axes to alleviate the performance degradation of SCD. SCDTour considers both (a) semantic similarity between axes in the embedding space, as well as (b) the degree to which each axis contributes to semantic change. Experimental results show that SCDTour preserves performance in semantic change detection while maintaining high interpretability. Moreover, agglomerating the sorted axes produces a more refined set of word senses, which achieves comparable or improved performance against the original full-dimensional embeddings in the SCD task. These findings demonstrate that SCDTour effectively balances interpretability and SCD performance, enabling meaningful interpretation of semantic shifts through a small number of refined axes. Source code is available at https://github.com/LivNLP/svp-tour .
中文:SCDTour方法通过排序和合并可解释轴,在保持语义变化检测性能的同时兼顾高可解释性,利用精炼的轴实现语义变化的有效解读,并获得可比或更优的结果。
English: SCDTour is a method that orders and merges interpretable axes to effectively balance semantic change detection performance with high interpretability, achieving comparable or improved results while enabling meaningful interpretation of semantic shifts through refined axes.

Authors:Liying Wang, Xiaoli Zhang, Chuanmin Jia, Siwei Ma
Title: MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic Segmentation
Abstract:
Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/.
中文摘要:本文提出MAFS统一网络,通过并行子网络和多任务框架整合红外-可见光图像融合与语义分割,实现两项任务的相互促进。
English Summary: This paper introduces MAFS, a unified network that integrates infrared-visible image fusion and semantic segmentation through parallel sub-networks and a multi-task framework to mutually enhance both tasks.

Authors:Haiduo Huang, Fuwei Yang, Zhenhua Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
Title: SpecVLM: Fast Speculative Decoding in Vision-Language Models
Abstract:
Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens whose count scales with image resolution and video length, inflating both compute and memory, especially the key-value (KV) cache. We study speculative decoding for VLMs and introduce SpecVLM, a practical system that (1) establishes a strong EAGLE-2-style baseline, EagleVLM, delivering 1.5--2.3x end-to-end speedups over full autoregressive inference, and (2) further accelerates VLM inference with an elastic visual compressor that adaptively selects among pruning, pooling, convolution, and resampler primitives to balance FLOPs/parameters and accuracy per input. To avoid costly offline distillation corpora, we propose an online-logit distillation protocol that trains the draft model with on-the-fly teacher logits and penultimate features using a combined cross-entropy and Smooth L1 objective, eliminating storage and preprocessing while remaining compute-efficient. This protocol reveals a training-time scaling effect: longer online training monotonically increases the draft model's average accepted length, improving speculative efficiency. Empirically, SpecVLM achieves additional acceleration, culminating in 2.5--2.9x end-to-end speedups within 5 epochs across LLaVA and MMMU, consistently over resolutions and task difficulties, while preserving the target model's output distribution (lossless decoding). Our code is available at https://github.com/haiduo/SpecVLM.
Chinese: 推测解码通过引入SpecVLM系统加速视觉语言模型,该系统采用弹性视觉压缩器和在线对数蒸馏技术,在保持无损解码的同时实现2.5-2.9倍加速。
English: Speculative decoding accelerates vision-language models by introducing SpecVLM, which employs an elastic visual compressor and online-logit distillation to achieve 2.5–2.9x speedups while maintaining lossless decoding.

Authors:Mehwish Mehmood, Shahzaib Iqbal, Tariq Mahmood Khan, Ivor Spence, Muhammad Fahim
Title: LFRA-Net: A Lightweight Focal and Region-Aware Attention Network for Retinal Vessel Segmentatio
Abstract:
Retinal vessel segmentation is critical for the early diagnosis of vision-threatening and systemic diseases, especially in real-world clinical settings with limited computational resources. Although significant improvements have been made in deep learning-based segmentation methods, current models still face challenges in extracting tiny vessels and suffer from high computational costs. In this study, we present LFRA-Net by incorporating focal modulation attention at the encoder-decoder bottleneck and region-aware attention in the selective skip connections. LFRA-Net is a lightweight network optimized for precise and effective retinal vascular segmentation. It enhances feature representation and regional focus by efficiently capturing local and global dependencies. LFRA-Net outperformed many state-of-the-art models while maintaining lightweight characteristics with only 0.17 million parameters, 0.66 MB memory size, and 10.50 GFLOPs. We validated it on three publicly available datasets: DRIVE, STARE, and CHASE\_DB. It performed better in terms of Dice score (84.28\%, 88.44\%, and 85.50\%) and Jaccard index (72.86\%, 79.31\%, and 74.70\%) on the DRIVE, STARE, and CHASE\_DB datasets, respectively. LFRA-Net provides an ideal ratio between segmentation accuracy and computational cost compared to existing deep learning methods, which makes it suitable for real-time clinical applications in areas with limited resources. The code can be found at https://github.com/Mehwish4593/LFRA-Net.
中文: LFRA-Net是一种轻量级深度学习模型,通过融合焦点调制和区域感知注意力机制,在低计算资源下实现了高精度的视网膜血管分割,适用于临床实际应用。
English: LFRA-Net is a lightweight deep learning model that enhances retinal vessel segmentation by integrating focal modulation and region-aware attention, achieving high accuracy with minimal computational resources for real-world clinical use.

Authors:Eden Mama, Liel Sheri, Yehudit Aperstein, Alexander Apartsin
Title: From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives
Abstract:
The widespread adoption of large language models (LLMs) in healthcare raises critical questions about their ability to interpret patient-generated narratives, which are often informal, ambiguous, and noisy. Existing benchmarks typically rely on clean, structured clinical text, offering limited insight into model performance under realistic conditions. In this work, we present a novel synthetic dataset designed to simulate patient self-descriptions characterized by varying levels of linguistic noise, fuzzy language, and layperson terminology. Our dataset comprises clinically consistent scenarios annotated with ground-truth diagnoses, spanning a spectrum of communication clarity to reflect diverse real-world reporting styles. Using this benchmark, we fine-tune and evaluate several state-of-the-art models (LLMs), including BERT-based and encoder-decoder T5 models. To support reproducibility and future research, we release the Noisy Diagnostic Benchmark (NDB), a structured dataset of noisy, synthetic patient descriptions designed to stress-test and compare the diagnostic capabilities of large language models (LLMs) under realistic linguistic conditions. We made the benchmark available for the community: https://github.com/lielsheri/PatientSignal
大语言模型在解读非正式和嘈杂的患者叙述时面临挑战,因此本研究引入一个包含不同语言噪声的合成数据集,以评估其在真实条件下的诊断准确性。
Large language models face challenges in interpreting informal and noisy patient narratives, so this study introduces a synthetic dataset with varying linguistic noise to evaluate their diagnostic accuracy under realistic conditions.

Authors:Dvora Goncharok, Arbel Shifman, Alexander Apartsin, Yehudit Aperstein
Title: When Curiosity Signals Danger: Predicting Health Crises Through Online Medication Inquiries
Abstract:
Online medical forums are a rich and underutilized source of insight into patient concerns, especially regarding medication use. Some of the many questions users pose may signal confusion, misuse, or even the early warning signs of a developing health crisis. Detecting these critical questions that may precede severe adverse events or life-threatening complications is vital for timely intervention and improving patient safety. This study introduces a novel annotated dataset of medication-related questions extracted from online forums. Each entry is manually labelled for criticality based on clinical risk factors. We benchmark the performance of six traditional machine learning classifiers using TF-IDF textual representations, alongside three state-of-the-art large language model (LLM)-based classification approaches that leverage deep contextual understanding. Our results highlight the potential of classical and modern methods to support real-time triage and alert systems in digital health spaces. The curated dataset is made publicly available to encourage further research at the intersection of patient-generated data, natural language processing, and early warning systems for critical health events. The dataset and benchmark are available at: https://github.com/Dvora-coder/LLM-Medication-QA-Risk-Classifier-MediGuard.
中文摘要:本研究引入了一个新颖的在线论坛药物相关问题的标注数据集,通过评估传统机器学习与先进语言模型,为数字健康领域实现关键健康风险的实时分类预警。
English Summary: This study introduces a novel annotated dataset of medication-related questions from online forums to detect critical health concerns, evaluating both traditional machine learning and advanced language models for real-time risk classification in digital health.

Authors:Diogo Mendonça, Tiago Barros, Cristiano Premebida, Urbano J. Nunes
Title: Seg2Track-SAM2: SAM2-based Multi-object Tracking and Segmentation for Zero-shot Generalization
Abstract:
Autonomous systems require robust Multi-Object Tracking (MOT) capabilities to operate reliably in dynamic environments. MOT ensures consistent object identity assignment and precise spatial delineation. Recent advances in foundation models, such as SAM2, have demonstrated strong zero-shot generalization for video segmentation, but their direct application to MOTS (MOT+Segmentation) remains limited by insufficient identity management and memory efficiency. This work introduces Seg2Track-SAM2, a framework that integrates pre-trained object detectors with SAM2 and a novel Seg2Track module to address track initialization, track management, and reinforcement. The proposed approach requires no fine-tuning and remains detector-agnostic. Experimental results on KITTI MOT and KITTI MOTS benchmarks show that Seg2Track-SAM2 achieves state-of-the-art (SOTA) performance, ranking fourth overall in both car and pedestrian classes on KITTI MOTS, while establishing a new benchmark in association accuracy (AssA). Furthermore, a sliding-window memory strategy reduces memory usage by up to 75% with negligible performance degradation, supporting deployment under resource constraints. These results confirm that Seg2Track-SAM2 advances MOTS by combining robust zero-shot tracking, enhanced identity preservation, and efficient memory utilization. The code is available at https://github.com/hcmr-lab/Seg2Track-SAM2
中文摘要:Seg2Track-SAM2提出了一种创新框架,将目标检测器与SAM2及Seg2Track模块相结合,无需微调即可实现最先进的多目标跟踪与分割性能,并通过滑动窗口内存策略降低75%内存使用,同时保持优异的身份关联精度。
English Summary: Seg2Track-SAM2 introduces a novel framework combining object detectors with SAM2 and a Seg2Track module to achieve state-of-the-art MOTS performance without fine-tuning, featuring enhanced identity management and 75% memory reduction through sliding-window optimization.

Authors:Lauri Seppäläinen, Jakub Kubečka, Jonas Elm, Kai Puolamäki
Title: Fast and Interpretable Machine Learning Modelling of Atmospheric Molecular Clusters
Abstract:
Understanding how atmospheric molecular clusters form and grow is key to resolving one of the biggest uncertainties in climate modelling: the formation of new aerosol particles. While quantum chemistry offers accurate insights into these early-stage clusters, its steep computational costs limit large-scale exploration. In this work, we present a fast, interpretable, and surprisingly powerful alternative: $k$-nearest neighbour ($k$-NN) regression model. By leveraging chemically informed distance metrics, including a kernel-induced metric and one learned via metric learning for kernel regression (MLKR), we show that simple $k$-NN models can rival more complex kernel ridge regression (KRR) models in accuracy, while reducing computational time by orders of magnitude. We perform this comparison with the well-established Faber-Christensen-Huang-Lilienfeld (FCHL19) molecular descriptor, but other descriptors (e.g., FCHL18, MBDF, and CM) can be shown to have similar performance. Applied to both simple organic molecules in the QM9 benchmark set and large datasets of atmospheric molecular clusters (sulphuric acid-water and sulphuric-multibase -base systems), our $k$-NN models achieve near-chemical accuracy, scale seamlessly to datasets with over 250,000 entries, and even appears to extrapolate to larger unseen clusters with minimal error (often nearing 1 kcal/mol). With built-in interpretability and straightforward uncertainty estimation, this work positions $k$-NN as a potent tool for accelerating discovery in atmospheric chemistry and beyond.
中文: 本研究提出了一种快速且可解释的k近邻回归模型,在保持精度的同时大幅降低计算成本,为研究大气分子团簇形成和推进气候建模提供了有力工具。
English: This study introduces a fast and interpretable $k$-nearest neighbor regression model that rivals complex methods in accuracy while drastically reducing computational costs, offering a powerful tool for studying atmospheric molecular cluster formation and advancing climate modeling.

Authors:Wa-Kin Lei, Jun-Cheng Chen, Shang-Tse Chen
Title: DRAG: Data Reconstruction Attack using Guided Diffusion
Abstract:
With the rise of large foundation models, split inference (SI) has emerged as a popular computational paradigm for deploying models across lightweight edge devices and cloud servers, addressing data privacy and computational cost concerns. However, most existing data reconstruction attacks have focused on smaller CNN classification models, leaving the privacy risks of foundation models in SI settings largely unexplored. To address this gap, we propose a novel data reconstruction attack based on guided diffusion, which leverages the rich prior knowledge embedded in a latent diffusion model (LDM) pre-trained on a large-scale dataset. Our method performs iterative reconstruction on the LDM's learned image prior, effectively generating high-fidelity images resembling the original data from their intermediate representations (IR). Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, both qualitatively and quantitatively, in reconstructing data from deep-layer IRs of the vision foundation model. The results highlight the urgent need for more robust privacy protection mechanisms for large models in SI scenarios. Code is available at: https://github.com/ntuaislab/DRAG.
Chinese: 本研究提出了一种基于引导扩散的数据重建攻击方法,能够从分割推理场景中视觉基础模型的中间表示有效恢复高保真度图像,揭示了严重的隐私风险。
English: This study introduces a guided diffusion-based data reconstruction attack that effectively recovers high-fidelity images from intermediate representations of vision foundation models in split inference scenarios, revealing significant privacy vulnerabilities.

Authors:Yuqian Wu, Yuhong Peng, Jiapeng Yu, Xiangyu Liu, Zeting Yan, Kang Lin, Weifeng Su, Bingqing Qu, Raymond Lee, Dingqi Yang
Title: Beyond Regularity: Modeling Chaotic Mobility Patterns for Next Location Prediction
Abstract:
Next location prediction is a key task in human mobility analysis, crucial for applications like smart city resource allocation and personalized navigation services. However, existing methods face two significant challenges: first, they fail to address the dynamic imbalance between periodic and chaotic mobile patterns, leading to inadequate adaptation over sparse trajectories; second, they underutilize contextual cues, such as temporal regularities in arrival times, which persist even in chaotic patterns and offer stronger predictability than spatial forecasts due to reduced search spaces. To tackle these challenges, we propose \textbf{\method}, a \underline{\textbf{C}}h\underline{\textbf{A}}otic \underline{\textbf{N}}eural \underline{\textbf{O}}scillator n\underline{\textbf{E}}twork for next location prediction, which introduces a biologically inspired Chaotic Neural Oscillatory Attention mechanism to inject adaptive variability into traditional attention, enabling balanced representation of evolving mobility behaviors, and employs a Tri-Pair Interaction Encoder along with a Cross Context Attentive Decoder to fuse multimodal ``who-when-where'' contexts in a joint framework for enhanced prediction performance. Extensive experiments on two real-world datasets demonstrate that CANOE consistently and significantly outperforms a sizeable collection of state-of-the-art baselines, yielding 3.17\%-13.11\% improvement over the best-performing baselines across different cases. In particular, CANOE can make robust predictions over mobility trajectories of different mobility chaotic levels. A series of ablation studies also supports our key design choices. Our code is available at: https://github.com/yuqian2003/CANOE.
中文摘要:提出的CANOE模型通过动态平衡周期性与混沌移动模式并整合上下文线索,解决了下一位置预测中的关键难题,相比现有方法实现了显著性能提升。
English Summary: The proposed CANOE model addresses challenges in next location prediction by dynamically balancing periodic and chaotic mobility patterns and integrating contextual cues, achieving significant performance improvements over existing methods.

Authors:Chuang Liu, Nan Guo
Title: Joint-octamamba:an octa joint segmentation network based on feature enhanced mamba
Abstract:
OCTA is a crucial non-invasive imaging technique for diagnosing and monitoring retinal diseases like diabetic retinopathy, age-related macular degeneration, and glaucoma. Current 2D-based methods for retinal vessel (RV) segmentation offer insufficient accuracy. To address this, we propose RVMamba, a novel architecture integrating multiple feature extraction modules with the Mamba state-space model. Moreover, existing joint segmentation models for OCTA data exhibit performance imbalance between different tasks. To simultaneously improve the segmentation of the foveal avascular zone (FAZ) and mitigate this imbalance, we introduce FAZMamba and a unified Joint-OCTAMamba framework. Experimental results on the OCTA-500 dataset demonstrate that Joint-OCTAMamba outperforms existing models across evaluation metrics.The code is available at https://github.com/lc-sfis/Joint-OCTAMamba.
Chinese: 提出的RVMamba和Joint-OCTAMamba框架显著提升了OCTA成像中视网膜血管和黄斑无血管区的分割效果,在OCTA-500数据集上全面优于现有模型。
English: The proposed RVMamba and Joint-OCTAMamba frameworks significantly enhance retinal vessel and foveal avascular zone segmentation in OCTA imaging, outperforming existing models on the OCTA-500 dataset.

Authors:Qiyuan Guan, Qianfeng Yang, Xiang Chen, Tianyu Song, Guiyue Jin, Jiyu Jin
Title: WeatherBench: A Real-World Benchmark Dataset for All-in-One Adverse Weather Image Restoration
Abstract:
Existing all-in-one image restoration approaches, which aim to handle multiple weather degradations within a single framework, are predominantly trained and evaluated using mixed single-weather synthetic datasets. However, these datasets often differ significantly in resolution, style, and domain characteristics, leading to substantial domain gaps that hinder the development and fair evaluation of unified models. Furthermore, the lack of a large-scale, real-world all-in-one weather restoration dataset remains a critical bottleneck in advancing this field. To address these limitations, we present a real-world all-in-one adverse weather image restoration benchmark dataset, which contains image pairs captured under various weather conditions, including rain, snow, and haze, as well as diverse outdoor scenes and illumination settings. The resulting dataset provides precisely aligned degraded and clean images, enabling supervised learning and rigorous evaluation. We conduct comprehensive experiments by benchmarking a variety of task-specific, task-general, and all-in-one restoration methods on our dataset. Our dataset offers a valuable foundation for advancing robust and practical all-in-one image restoration in real-world scenarios. The dataset has been publicly released and is available at https://github.com/guanqiyuan/WeatherBench.
中文: 本文提出了一个真实世界的多天气图像恢复基准数据集,解决了现有合成数据集存在的领域差异问题,为统一模型的监督学习和严格评估提供了基础。
English: This paper introduces a real-world benchmark dataset for all-in-one adverse weather image restoration, addressing domain gaps in existing synthetic datasets and enabling supervised learning and rigorous evaluation of unified models.

Authors:Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, Linfeng Zhang
Title: SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
Abstract:
Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations. SpeCa's core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: \textbf{https://github.com/Shenyi-Z/Cache4Diffusion}
中文: SpeCa通过提出预测性采样框架,在扩散模型中预测后续时间步特征并高效验证可靠性,以最小计算开销实现最高7.3倍加速,同时保持生成质量。
English: SpeCa introduces a speculative sampling framework that accelerates diffusion models by predicting future timesteps and verifying their reliability with minimal overhead, achieving up to 7.3x speedup while maintaining generation quality.

Authors:Qi Zheng, Chaoran Zhang, Zijian Liang, EnTe Lin, Shubo Cui, Qinghongbing Xie, Zhaobo Xu, Long Zeng
Title: AssemMate: Graph-Based LLM for Robotic Assembly Assistance
Abstract:
Large Language Model (LLM)-based robotic assembly assistance has gained significant research attention. It requires the injection of domain-specific knowledge to guide the assembly process through natural language interaction with humans. Despite some progress, existing methods represent knowledge in the form of natural language text. Due to the long context and redundant content, they struggle to meet the robots' requirements for real-time and precise reasoning. In order to bridge this gap, we present AssemMate, which utilizes the graph\textemdash a concise and accurate form of knowledge representation\textemdash as input. This graph-based LLM enables knowledge graph question answering (KGQA), supporting human-robot interaction and assembly task planning for specific products. Beyond interactive QA, AssemMate also supports sensing stacked scenes and executing grasping to assist with assembly. Specifically, a self-supervised Graph Convolutional Network (GCN) encodes knowledge graph entities and relations into a latent space and aligns them with LLM's representation, enabling the LLM to understand graph information. In addition, a vision-enhanced strategy is employed to address stacked scenes in grasping. Through training and evaluation, AssemMate outperforms existing methods, achieving 6.4\% higher accuracy, 3 times faster inference, and 28 times shorter context length, while demonstrating strong generalization ability on random graphs. And our approach further demonstrates superiority through robotic grasping experiments in both simulated and real-world settings. More details can be found on the project page: https://github.com/cristina304/AssemMate.git
Chinese: AssemMate采用基于图的LLM方法,通过知识图谱提升机器人装配的实时推理与交互能力,实现了更高精度、更快推理速度和更短上下文长度,并在仿真和现实环境中支持视觉辅助抓取。
English: AssemMate introduces a graph-based LLM approach for robotic assembly, utilizing knowledge graphs to enhance real-time reasoning and interaction, achieving higher accuracy, faster inference, and shorter context lengths while supporting vision-aided grasping in both simulated and real environments.

Authors:Yanyun Pu, Kehan Li, Zeyi Huang, Zhijie Zhong, Kaixiang Yang
Title: MVQA-68K: A Multi-dimensional and Causally-annotated Dataset with Quality Interpretability for Video Assessment
Abstract:
With the rapid advancement of video generation models such as Sora, video quality assessment (VQA) is becoming increasingly crucial for selecting high-quality videos from large-scale datasets used in pre-training. Traditional VQA methods, typically producing single numerical scores, often lack comprehensiveness and interpretability. To address these challenges, we introduce MVQA-68K, a novel multi-dimensional VQA dataset comprising over 68,000 carefully annotated videos, covering seven essential quality dimensions: overall aesthetics, camera movement, dynamic degree, texture detail, composition, visual quality, and factual consistency. Each annotation includes detailed chain-of-thought reasoning to facilitate interpretability and comprehensive understanding. Extensive experiments demonstrate that MVQA-68K significantly enhances the performance of various multimodal large language models (MLLMs) on the VQA task, achieving state-of-the-art results not only on our internal test set (Fig.1) but also on public benchmarks including LSVQ-test, LSVQ-1080p, and LIVE-VQC. Meantime, incorporating explicit reasoning process during VQA training substantially boosts the zero-shot generalization. Code and dataset will be available at github: https://github.com/Controller01-ai/MVQA-68K
Chinese: 为解决传统视频质量评估方法的不足,我们推出了MVQA-68K多维数据集,包含超过68,000个带详细推理标注的视频,显著提升了多模态模型在视频质量评估任务上的性能与泛化能力。
English: To address the limitations of traditional video quality assessment methods, we introduce MVQA-68K, a multi-dimensional dataset with over 68,000 annotated videos and detailed reasoning, which significantly improves multimodal models' performance and generalization on VQA tasks.

Authors:Haonan Shi, Yubin Wang, De Cheng, Lingfeng He, Nannan Wang, Xinbo Gao
Title: Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification
Abstract:
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets by reducing the modality gap while minimizing reliance on costly manual annotations. Existing methods typically address USVI-ReID using cluster-based contrastive learning, which represents a person by a single cluster center. However, they primarily focus on the commonality of images within each cluster while neglecting the finer-grained differences among them. To address the limitation, we propose a Hierarchical Identity Learning (HIL) framework. Since each cluster may contain several smaller sub-clusters that reflect fine-grained variations among images, we generate multiple memories for each existing coarse-grained cluster via a secondary clustering. Additionally, we propose Multi-Center Contrastive Learning (MCCL) to refine representations for enhancing intra-modal clustering and minimizing cross-modal discrepancies. To further improve cross-modal matching quality, we design a Bidirectional Reverse Selection Transmission (BRST) mechanism, which establishes reliable cross-modal correspondences by performing bidirectional matching of pseudo-labels. Extensive experiments conducted on the SYSU-MM01 and RegDB datasets demonstrate that the proposed method outperforms existing approaches. The source code is available at: https://github.com/haonanshi0125/HIL.
中文摘要:该研究提出的分层身份学习框架通过多中心对比学习和双向匹配机制,解决了无监督可见光-红外行人重识别中细粒度差异被忽视的问题,在减少跨模态差异的同时显著提升了基准数据集上的性能表现。
English Summary: The proposed Hierarchical Identity Learning framework addresses limitations in unsupervised visible-infrared person re-identification by introducing multi-center contrastive learning and bidirectional matching to capture fine-grained variations while reducing cross-modal discrepancies, achieving superior performance on benchmark datasets.

Authors:Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen
Title: A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models
Abstract:
Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.
中文摘要:该综述将时间序列推理定义为将时间作为主要轴心,并按三种推理拓扑结构组织文献,评估其跨领域应用,同时强调需在计算成本与可靠、基于证据的结果之间取得平衡。
English Summary: This survey defines time series reasoning as treating time as a primary axis and organizes research into three reasoning topologies, evaluating their applications across domains while emphasizing the need to balance computational costs with reliable, evidence-based outcomes.

Authors:Sampoorna Poria, Xiaolei Huang
Title: Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia -- Current Stage and Challenges
Abstract:
Rapid developments of large language models have revolutionized many NLP tasks for English data. Unfortunately, the models and their evaluations for low-resource languages are being overlooked, especially for languages in South Asia. Although there are more than 650 languages in South Asia, many of them either have very limited computational resources or are missing from existing language models. Thus, a concrete question to be answered is: Can we assess the current stage and challenges to inform our NLP community and facilitate model developments for South Asian languages? In this survey, we have comprehensively examined current efforts and challenges of NLP models for South Asian languages by retrieving studies since 2020, with a focus on transformer-based models, such as BERT, T5, & GPT. We present advances and gaps across 3 essential aspects: data, models, & tasks, such as available data sources, fine-tuning strategies, & domain applications. Our findings highlight substantial issues, including missing data in critical domains (e.g., health), code-mixing, and lack of standardized evaluation benchmarks. Our survey aims to raise awareness within the NLP community for more targeted data curation, unify benchmarks tailored to cultural and linguistic nuances of South Asia, and encourage an equitable representation of South Asian languages. The complete list of resources is available at: https://github.com/trust-nlp/LM4SouthAsia-Survey.
中文: 大型语言模型的快速发展革新了英语自然语言处理任务,但南亚语言却面临严重忽视,资源匮乏且模型评估不足,本调查旨在揭示这些问题并倡导公平发展和标准化基准。
English: The rapid advancement of large language models has transformed English NLP tasks, yet South Asian languages face significant neglect, with limited resources and inadequate model evaluations, prompting this survey to highlight challenges and advocate for equitable development and standardized benchmarks.

Authors:Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang
Title: UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
Abstract:
Graphical User Interface (GUI) agents have demonstrated remarkable progress in automating complex user interface interactions through reinforcement learning. However, current approaches face a fundamental dilemma: offline RL enables stable training on pre-collected trajectories, but struggles with multi-step task execution for lack of trajectory-level reward signals; online RL captures these signals through environment interaction, but suffers from sparse rewards and prohibitive deployment costs. To address it, we present Semi-online Reinforcement Learning, a novel paradigm that simulates online RL on offline trajectories. During each rollout process, we preserve the original model output within the multi-turn dialogue, where a Patch Module adaptively recovers the divergence between rollout and expert trajectories. To capture long-term training signals, Semi-online RL introduces discounted future returns into the reward computation and optimizes the policy with weighted step-level and episode-level advantages. We further introduce Semi-Online Performance (SOP), a metric that aligns better with true online performance, serving as a practical and effective proxy for real-world evaluation. Experiments show that ours Semi-online RL achieves SOTA performance among 7B models across four dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging the gap between offline training efficiency and online multi-turn reasoning. The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.
Chinese: 半在线强化学习作为一种新范式,在离线轨迹上模拟在线强化学习,通过补丁模块自适应修正轨迹差异,并引入折扣未来回报来捕捉长期训练信号,有效弥合了离线训练效率与在线多步推理之间的差距,在多个动态基准测试中实现了最先进的性能。
English: Semi-online reinforcement learning is introduced as a novel paradigm that simulates online RL on offline trajectories, employing a Patch Module and incorporating discounted future returns to effectively bridge the gap between offline training efficiency and online multi-step task execution, achieving state-of-the-art performance across multiple benchmarks.

Authors:Dezhen Wang, Haixiang Zhao, Xiang Shen, Sheng Miao
Title: SFGNet: Semantic and Frequency Guided Network for Camouflaged Object Detection
Abstract:
Camouflaged object detection (COD) aims to segment objects that blend into their surroundings. However, most existing studies overlook the semantic differences among textual prompts of different targets as well as fine-grained frequency features. In this work, we propose a novel Semantic and Frequency Guided Network (SFGNet), which incorporates semantic prompts and frequency-domain features to capture camouflaged objects and improve boundary perception. We further design Multi-Band Fourier Module(MBFM) to enhance the ability of the network in handling complex backgrounds and blurred boundaries. In addition, we design an Interactive Structure Enhancement Block (ISEB) to ensure structural integrity and boundary details in the predictions. Extensive experiments conducted on three COD benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches. The core code of the model is available at the following link: https://github.com/winter794444/SFGNetICASSP2026.
中文摘要:本研究提出的SFGNet方法通过整合语义提示和频域特征,结合专门设计的模块,在伪装物体检测和边界感知方面显著优于现有方法。
English Summary: The proposed SFGNet method integrates semantic prompts and frequency-domain features through specialized modules to significantly improve camouflaged object detection and boundary perception, demonstrating superior performance over existing approaches.

Authors:Wenhao Tang, Sheng Huang, Heng Fang, Fengtao Zhou, Bo Liu, Qingshan Liu
Title: Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis
Abstract:
Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: https://github.com/DearCaat/MHIM-MIL.
中文: MHIM-MIL框架通过孪生网络结构和掩码硬实例挖掘技术,有效解决了计算病理学中难以分类样本被忽略的问题,在多项基准测试中性能与效率均显著超越现有方法。
English: The MHIM-MIL framework introduces masked hard instance mining through a Siamese structure to address the neglect of challenging instances in computational pathology, significantly outperforming existing methods across multiple benchmarks while improving efficiency.

Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Title: Know What You Don't Know: Selective Prediction for Early Exit DNNs
Abstract:
Inference latency and trustworthiness of Deep Neural Networks (DNNs) are the bottlenecks in deploying them in critical applications like sensitive tasks. Early Exit (EE) DNNs overcome the latency issues by allowing samples to exit from intermediary layers if they attain `high' confidence scores on the predicted class. However, the DNNs are known to exhibit overconfidence, which can lead to many samples exiting early and render EE strategies untrustworthy. We use Selective Prediction (SP) to overcome this issue by checking the `hardness' of the samples rather than just relying on the confidence score alone. We propose SPEED, a novel approach that uses Deferral Classifiers (DCs) at each layer to check the hardness of samples before performing EEs. Specifically, the DCs identify if a sample is hard to predict at an intermediary layer, leading to hallucination, and defer it to an expert. Early detection of hard samples for inference prevents the wastage of computational resources and improves trust by deferring the hard samples to the expert. We demonstrate that EE aided with SP improves both accuracy and latency. Our method minimizes the risk of wrong prediction by $50\%$ with a speedup of $2.05\times$ as compared to the final layer. The anonymized source code is available at https://github.com/Div290/SPEED
中文: SPEED提出了一种新颖方法,通过在各层使用延迟分类器进行选择性预测,识别并推迟困难样本,将错误预测风险降低50%,在早期退出深度神经网络中实现2.05倍加速,同时提升准确性和延迟性能。
English: SPEED introduces a novel method using selective prediction with deferral classifiers at each layer to identify and defer hard samples, reducing wrong predictions by 50% and achieving a 2.05× speedup while improving both accuracy and latency in early exit deep neural networks.

Authors:Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca
Title: PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
Abstract:
BACKGROUND: Medical large language models (LLMS) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: to build a dataset of questions from medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) datasets containing 8,380 questions spanning 12 medical domains (2018-2025). We selected eight medical LLMs including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task-specific prompts to answer the questions appropriately. We employed parameter-efficient fine tuning (PEFT)and low-rant adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: medgemma-27b-text-it outperformed all other models, achieving a proportion of correct answers exceeding 90% in several instances. LLMs with <10 billion parameters exhibited <60% of correct answers, while some exams yielded results <50%. The fine-tuned version of medgemma-4b-it emerged victorious agains all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI application and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profiles to Peru's, interested parties should utilize medgemma-27b-text-it or a fine-tuned version of medgemma-4b-it.
中文摘要:本研究评估了医学大语言模型在秘鲁西班牙语医学考试中的表现,发现medgemma-27b-text-it和微调后的medgemma-4b-it模型表现最优,特别适用于西班牙语国家及与秘鲁流行病学特征相似的地区。
English Summary: This study evaluates medical LLMs' performance on Spanish-language medical exams from Peru, finding that medgemma-27b-text-it and fine-tuned medgemma-4b-it deliver superior accuracy, making them optimal for Spanish-speaking regions with similar epidemiological profiles to Peru.

Authors:Fabrycio Leite Nakano Almada, Kauan Divino Pouso Mariano, Maykon Adriell Dutra, Victor Emanuel da Silva Monteiro, Juliana Resplande Sant'Anna Gomes, Arlindo Rodrigues Galvão Filho, Anderson da Silva Soares
Title: AKCIT-FN at CheckThat! 2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization
Abstract:
Claim normalization, the transformation of informal social media posts into concise, self-contained statements, is a crucial step in automated fact-checking pipelines. This paper details our submission to the CLEF-2025 CheckThat! Task~2, which challenges systems to perform claim normalization across twenty languages, divided into thirteen supervised (high-resource) and seven zero-shot (no training data) tracks. Our approach, leveraging fine-tuned Small Language Models (SLMs) for supervised languages and Large Language Model (LLM) prompting for zero-shot scenarios, achieved podium positions (top three) in fifteen of the twenty languages. Notably, this included second-place rankings in eight languages, five of which were among the seven designated zero-shot languages, underscoring the effectiveness of our LLM-based zero-shot strategy. For Portuguese, our initial development language, our system achieved an average METEOR score of 0.5290, ranking third. All implementation artifacts, including inference, training, evaluation scripts, and prompt configurations, are publicly available at https://github.com/ju-resplande/checkthat2025_normalization.
中文摘要:本文提出了一种用于自动化事实核查的声明规范化系统,在CLEF-2025竞赛的20种语言任务中,有15种语言进入前三名,其中基于大语言模型的零样本方法在无训练数据的语言上表现尤为突出。
English Summary: This paper presents a claim normalization system for automated fact-checking that achieved top-three results in 15 out of 20 languages at CLEF-2025, demonstrating particular strength in zero-shot scenarios through effective LLM prompting strategies.

Authors:Ayhan Can Erdur, Christian Beischl, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C Peeken
Title: MultiMAE for Brain MRIs: Robustness to Missing Inputs Using Multi-Modal Masked Autoencoder
Abstract:
Missing input sequences are common in medical imaging data, posing a challenge for deep learning models reliant on complete input data. In this work, inspired by MultiMAE [2], we develop a masked autoencoder (MAE) paradigm for multi-modal, multi-task learning in 3D medical imaging with brain MRIs. Our method treats each MRI sequence as a separate input modality, leveraging a late-fusion-style transformer encoder to integrate multi-sequence information (multi-modal) and individual decoder streams for each modality for multi-task reconstruction. This pretraining strategy guides the model to learn rich representations per modality while also equipping it to handle missing inputs through cross-sequence reasoning. The result is a flexible and generalizable encoder for brain MRIs that infers missing sequences from available inputs and can be adapted to various downstream applications. We demonstrate the performance and robustness of our method against an MAE-ViT baseline in downstream segmentation and classification tasks, showing absolute improvement of $10.1$ overall Dice score and $0.46$ MCC over the baselines with missing input sequences. Our experiments demonstrate the strength of this pretraining strategy. The implementation is made available.
中文摘要:本研究提出了一种用于3D脑部MRI分析的掩码自编码器框架,通过将各MRI序列视为独立模态,利用跨序列推理重建缺失数据,在后续任务中实现了显著性能提升。
English Summary: This study introduces a masked autoencoder framework for 3D brain MRI analysis that handles missing sequences by treating each MRI sequence as a separate modality, using cross-sequence reasoning to reconstruct missing data while achieving significant performance gains in downstream tasks.

Authors:Jeanny Pan, Philipp Seeböck, Christoph Fürböck, Svitlana Pochepnia, Jennifer Straub, Lucian Beer, Helmut Prosch, Georg Langs
Title: Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery
Abstract:
Identifying new disease-related patterns in medical imaging data with the help of machine learning enlarges the vocabulary of recognizable findings. This supports diagnostic and prognostic assessment. However, image appearance varies not only due to biological differences, but also due to imaging technology linked to vendors, scanning- or re- construction parameters. The resulting domain shifts impedes data representation learning strategies and the discovery of biologically meaningful cluster appearances. To address these challenges, we introduce an approach to actively learn the domain shift via post-hoc rotation of the data latent space, enabling disentanglement of biological and technical factors. Results on real-world heterogeneous clinical data showcase that the learned disentangled representation leads to stable clusters representing tissue-types across different acquisition settings. Cluster consistency is improved by +19.01% (ARI), +16.85% (NMI), and +12.39% (Dice) compared to the entangled representation, outperforming four state-of-the-art harmonization methods. When using the clusters to quantify tissue composition on idiopathic pulmonary fibrosis patients, the learned profiles enhance Cox survival prediction. This indicates that the proposed label-free framework facilitates biomarker discovery in multi-center routine imaging data. Code is available on GitHub https://github.com/cirmuw/latent-space-rotation-disentanglement.
中文: 通过潜在空间旋转主动学习域偏移,该方法在医学影像中分离生物与技术因素,提升了跨设备聚类稳定性,并在多中心数据中增强了生存预测能力。
English: Machine learning identifies new disease patterns in medical imaging by disentangling biological and technical factors through latent space rotation, improving cluster consistency and enhancing survival prediction in multi-center data.

Authors:Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman
Title: FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Abstract:
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
中文摘要:FuseCodec通过跨模态对齐和全局监督融合了声学、语义和上下文语音表征,在转录准确性和语音质量方面实现了最先进的性能。
English Summary: FuseCodec unifies acoustic, semantic, and contextual speech representations through cross-modal alignment and global supervision, achieving state-of-the-art performance in transcription accuracy and speech quality.

Authors:Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman
Title: FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Abstract:
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.
中文摘要:FuseCodec通过跨模态对齐和全局监督融合了声学、语义和上下文语音表征,在转录准确性和语音质量方面实现了最先进的性能。
English Summary: FuseCodec unifies acoustic, semantic, and contextual speech representations through cross-modal alignment and global supervision, achieving state-of-the-art performance in transcription accuracy and speech quality.

Authors:Yijia Xiao, Edward Sun, Tong Chen, Fang Wu, Di Luo, Wei Wang
Title: Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning
Abstract:
Developing professional, structured reasoning on par with human financial analysts and traders remains a central challenge in AI for finance, where markets demand interpretability and trust. Traditional time-series models lack explainability, while LLMs face challenges in turning natural-language analysis into disciplined, executable trades. Although reasoning LLMs have advanced in step-by-step planning and verification, their application to risk-sensitive financial decisions is underexplored. We present Trading-R1, a financially-aware model that incorporates strategic thinking and planning for comprehensive thesis composition, facts-grounded analysis, and volatility-adjusted decision making. Trading-R1 aligns reasoning with trading principles through supervised fine-tuning and reinforcement learning with a three-stage easy-to-hard curriculum. Training uses Tauric-TR1-DB, a 100k-sample corpus spanning 18 months, 14 equities, and five heterogeneous financial data sources. Evaluated on six major equities and ETFs, Trading-R1 demonstrates improved risk-adjusted returns and lower drawdowns compared to both open-source and proprietary instruction-following models as well as reasoning models. The system generates structured, evidence-based investment theses that support disciplined and interpretable trading decisions. Trading-R1 Terminal will be released at https://github.com/TauricResearch/Trading-R1.
中文摘要:Trading-R1是一种具备金融意识的AI模型,通过结构化推理和基于证据的投资论述,提高了风险调整后收益并降低了回撤,满足了金融市场对可解释交易决策的需求。
English Summary: Trading-R1 is a financially-aware AI model that enhances risk-adjusted returns and reduces drawdowns through structured reasoning and evidence-based investment theses, addressing the need for interpretable trading decisions in financial markets.

Authors:Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long
Title: Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding
Abstract:
Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF's parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).
中文: 本文提出了一种名为语义独立卡尔曼网络(SIKNet)的学习辅助运动估计方法,通过两步编码状态向量来捕捉独立语义和非线性依赖信息,在多目标跟踪中展现出优于传统卡尔曼滤波器和其他学习型滤波器的鲁棒性与准确性。
English: This paper introduces Semantic-Independent KalmanNet (SIKNet), a learning-aided motion estimation method for multi-object tracking that enhances robustness and accuracy by encoding state vectors with independent semantic and nonlinear dependency information, outperforming traditional Kalman filters and other learning-based approaches.

Authors:Ziling Liu, Ziwei Chen, Mingqi Gao, Jinyu Yang, Feng Zheng
Title: Leveraging Geometric Priors for Unaligned Scene Change Detection
Abstract:
Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we introduce geometric priors for the first time to address the core challenges of unaligned SCD, for reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.
中文: 本文首次将几何先验引入非对齐场景变化检测,以解决视角变化和遮挡问题,提出了一种无需训练的框架,将该先验与视觉基础模型结合,在多个数据集上实现了优越且鲁棒的检测性能。
English: This paper introduces geometric priors into unaligned scene change detection to address viewpoint variations and occlusion challenges, proposing a training-free framework that integrates these priors with a visual foundation model for robust performance across multiple datasets.

Authors:Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Jun Gao, Congxuan Zhang, Xiaojuan Qi, Bing Li, Weiming Hu
Title: Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations
Abstract:
Large Vision-Language Models (LVLMs) suffer from serious hallucination problems, where the model-generated responses are inconsistent with the visual inputs. Existing hallucination mitigation methods are mainly based on preference alignment and require external human annotations or auxiliary models for preference data collection, which increase costs and limit sustainable improvement. To tackle these challenges, we propose Autonomous Preference Alignment via Self-Injection (APASI), a novel and generalizable method that mitigates hallucinations without external dependencies. APASI leverages the target LVLM to self-inject hallucinations into a generated response, creating a pair of responses with varying preference levels. During the self-injection process, the dis-preferred response is generated based on three key observations of hallucinations, ensuring it simulates real hallucination patterns. This fidelity offers an accurate learning signal for hallucination mitigation. Moreover, APASI incorporates an iterative alignment training strategy combined with curriculum learning to periodically update the preference data with increasing challenge, enabling stable and continuous enhancement of the LVLM. Extensive experiments across six benchmarks show that APASI not only effectively mitigates hallucinations for three baseline models but also achieves comparable or even superior performance to alignment-based methods with external dependency, thereby demonstrating its effectiveness and generalization capability. The code is available at https://github.com/davidluciolu/APASI.
中文: APASI是一种创新的自主偏好对齐方法,通过自我注入模拟幻觉并采用迭代训练,无需外部依赖即可有效减少大型视觉语言模型的幻觉问题,且性能媲美依赖外部资源的方法。
English: APASI is a novel autonomous preference alignment method that mitigates hallucinations in Large Vision-Language Models by self-injecting simulated hallucinations and using iterative training, achieving competitive performance without external dependencies.

Authors:Kerun Mi, Guoliang Kang, Guangyu Li, Lin Zhao, Tao Zhou, Chen Gong
Title: Cross-Domain Attribute Alignment with CLIP: A Rehearsal-Free Approach for Class-Incremental Unsupervised Domain Adaptation
Abstract:
Class-Incremental Unsupervised Domain Adaptation (CI-UDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where the sets of potential target classes appearing at different time steps are disjoint and are subsets of the source classes. The key to solving this problem lies in avoiding catastrophic forgetting of knowledge about previous target classes during continuously mitigating the domain shift. Most previous works cumbersomely combine two technical components. On one hand, they need to store and utilize rehearsal target sample from previous time steps to avoid catastrophic forgetting; on the other hand, they perform alignment only between classes shared across domains at each time step. Consequently, the memory will continuously increase and the asymmetric alignment may inevitably result in knowledge forgetting. In this paper, we propose to mine and preserve domain-invariant and class-agnostic knowledge to facilitate the CI-UDA task. Specifically, via using CLIP, we extract the class-agnostic properties which we name as "attribute". In our framework, we learn a "key-value" pair to represent an attribute, where the key corresponds to the visual prototype and the value is the textual prompt. We maintain two attribute dictionaries, each corresponding to a different domain. Then we perform attribute alignment across domains to mitigate the domain shift, via encouraging visual attention consistency and prediction consistency. Through attribute modeling and cross-domain alignment, we effectively reduce catastrophic knowledge forgetting while mitigating the domain shift, in a rehearsal-free way. Experiments on three CI-UDA benchmarks demonstrate that our method outperforms previous state-of-the-art methods and effectively alleviates catastrophic forgetting. Code is available at https://github.com/RyunMi/VisTA.
中文: 针对类别增量无监督领域自适应(CI-UDA)问题,本方法通过CLIP提取领域无关的"属性"特征,采用键值对表示并进行跨领域对齐,在无需记忆样本的情况下同步解决领域偏移和灾难性遗忘问题,实验证明其性能优于现有最优方法。
English: The proposed method for Class-Incremental Unsupervised Domain Adaptation (CI-UDA) leverages CLIP to extract domain-invariant attributes represented as key-value pairs, enabling rehearsal-free cross-domain alignment that effectively mitigates both domain shift and catastrophic forgetting while outperforming prior approaches.

Authors:Chengze li, Yitong Zhang, Jia Li, Liyi Cai, Ge Li
Title: Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation
Abstract:
LLMs have become the mainstream approaches to code generation. Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right. However, the underlying autoregressive generation has two limitations in code generation. First, autoregressive LLMs only generate a token at each step, showing low efficiency in practice. Second, programming is a non-sequential process involving back-and-forth editing, while autoregressive LLMs only employ the left-to-right generation order. These two intrinsic limitations hinder the further development of LLMs in code generation. Recently, diffusion LLMs have emerged as a promising alternative. Diffusion LLMs address the above limitations with two advances, including multi-token prediction (i.e. generating multiple tokens at each step) and flexible generation order (i.e. flexibly determining which positions to generate tokens). However, there is no systematic study exploring diffusion LLMs in code generation. To bridge the knowledge gap, we present the first empirical study of diffusion LLMs for code generation. Our study involves 9 representative diffusion LLMs and conduct experiments on 4 widely used benchmarks. Based on the results, we summarize the following findings. (1) Existing diffusion LLMs are competitive with autoregressive LLMs with similar sizes. (2) Diffusion LLMs have a stronger length extrapolation ability than autoregressive LLMs and perform better in long code understanding. (3) We explore factors impacting the effectiveness and efficiency of diffusion LLMs, and provide practical guidance. (4) We discuss several promising further directions to improve diffusion LLMs on code generation. We open-source all source code, data, and results to facilitate the following research. The code is publicly available at https://github.com/zhangyitonggg/dllm4code.
中文摘要:自回归大语言模型在代码生成中存在效率低和顺序限制的问题,而扩散大语言模型通过多令牌预测和灵活生成顺序提供了有前景的替代方案,首个实证研究在四个基准测试中验证了九种模型的竞争优势。
English Summary: Autoregressive LLMs face efficiency and flexibility limitations in code generation, while diffusion LLMs offer promising alternatives through multi-token prediction and flexible generation order, as demonstrated by the first empirical study comparing nine models across four benchmarks.

Authors:Yitong Zhang, Ximo Li, Liyi Cai, Jia Li
Title: Realistic Environmental Injection Attacks on GUI Agents
Abstract:
GUI agents built on LVLMs are increasingly used to interact with websites. However, their exposure to open-world content makes them vulnerable to Environmental Injection Attacks (EIAs) that hijack agent behavior via webpage elements. Many recent studies assume the attacker to be a regular user who can only upload a single trigger image, which is more realistic than earlier assumptions of website-level administrative control. However, these works still fall short of realism: (1) the trigger's position and surrounding context remain largely fixed between training and testing, failing to capture the dynamic nature of real webpages and (2) the trigger often occupies an unrealistically large area, whereas real-world images are typically small. To better reflect real-world scenarios, we introduce a more realistic threat model where the attacker is a regular user and the trigger image is small and embedded within a dynamically changing environment. As a result, existing attacks prove largely ineffective under this threat model. To better expose the vulnerabilities of GUI agents, we propose Chameleon, an attack framework with two main novelties. The first is LLM-Driven Environment Simulation, which automatically generates diverse and high-fidelity webpage simulations. The second is Attention Black Hole, which transforms attention weights into explicit supervisory signals that guide the agent's focus toward the trigger region. We evaluate Chameleon on 6 realistic websites and 4 representative LVLM-powered GUI agents, where it significantly outperforms existing methods. Ablation studies confirm that both novelties are critical to performance. Our findings reveal underexplored vulnerabilities in modern GUI agents and establish a robust foundation for future research on defense in open-world GUI agent systems. The code is publicly available at https://github.com/zhangyitonggg/attack2gui.
中文: 基于LVLM的GUI代理易受环境注入攻击,本研究提出Chameleon攻击框架,通过模拟动态网页环境和引导代理关注小尺寸触发器,在现实威胁模型下显著优于现有方法。
English: GUI agents based on LVLMs are vulnerable to Environmental Injection Attacks, and this study introduces Chameleon, a novel attack framework that simulates dynamic web environments and directs agent attention to small triggers, significantly outperforming existing methods under realistic threat models.

Authors:Yechen Zhang, Bin Gao, Gang Wang, Jian Sun, Zhuo Li
Title: CORB-Planner: Corridor as Observations for RL Planning in High-Speed Flight
Abstract:
Reinforcement learning (RL) has shown promise in a large number of robotic control tasks. Nevertheless, its deployment on unmanned aerial vehicles (UAVs) remains challenging, mainly because of reliance on accurate dynamic models and platform-specific sensing, which hinders cross-platform transfer. This paper presents the CORB-Planner (Corridor-as-Observations for RL B-spline planner), a real-time, RL-based trajectory planning framework for high-speed autonomous UAV flight across heterogeneous platforms. The key idea is to combine B-spline trajectory generation with the RL policy producing successive control points with a compact safe flight corridor (SFC) representation obtained via heuristic search. The SFC abstracts obstacle information in a low-dimensional form, mitigating overfitting to platform-specific details and reducing sensitivity to model inaccuracies. To narrow the sim-to-real gap, we adopt an easy-to-hard progressive training pipeline in simulation. A value-based soft decomposed-critic Q (SDCQ) algorithm is used to learn effective policies within approximately ten minutes of training. Benchmarks in simulation and real-world tests demonstrate real-time planning on lightweight onboard hardware and support maximum flight speeds up to 8.2m/s in dense, cluttered environments without external positioning. Compatibility with various UAV configurations (quadrotors, hexarotors) and modest onboard compute underlines the generality and robustness of CORB-Planner for practical deployment.
中文:CORB-Planner是一种基于强化学习的框架,通过安全飞行走廊和B样条轨迹生成,实现了跨平台无人机的实时高速轨迹规划,在复杂环境中仅需少量训练即可获得鲁棒性能。
English: The CORB-Planner is a reinforcement learning-based framework that enables real-time, high-speed UAV trajectory planning across different platforms by using safe flight corridors and B-spline generation, achieving robust performance in cluttered environments with minimal training time.

Authors:Gao Yu Lee, Tanmoy Dam, Md Meftahul Ferdaus, Daniel Puiu Poenar, Vu N. Duong
Title: ANROT-HELANet: Adverserially and Naturally Robust Attention-Based Aggregation Network via The Hellinger Distance for Few-Shot Classification
Abstract:
Few-Shot Learning (FSL), which involves learning to generalize using only a few data samples, has demonstrated promising and superior performances to ordinary CNN methods. While Bayesian based estimation approaches using Kullback-Leibler (KL) divergence have shown improvements, they remain vulnerable to adversarial attacks and natural noises. We introduce ANROT-HELANet, an Adversarially and Naturally RObusT Hellinger Aggregation Network that significantly advances the state-of-the-art in FSL robustness and performance. Our approach implements an adversarially and naturally robust Hellinger distance-based feature class aggregation scheme, demonstrating resilience to adversarial perturbations up to $ε=0.30$ and Gaussian noise up to $σ=0.30$. The network achieves substantial improvements across benchmark datasets, including gains of 1.20\% and 1.40\% for 1-shot and 5-shot scenarios on miniImageNet respectively. We introduce a novel Hellinger Similarity contrastive loss function that generalizes cosine similarity contrastive loss for variational few-shot inference scenarios. Our approach also achieves superior image reconstruction quality with a FID score of 2.75, outperforming traditional VAE (3.43) and WAE (3.38) approaches. Extensive experiments conducted on four few-shot benchmarked datasets verify that ANROT-HELANet's combination of Hellinger distance-based feature aggregation, attention mechanisms, and our novel loss function establishes new state-of-the-art performance while maintaining robustness against both adversarial and natural perturbations. Our code repository will be available at https://github.com/GreedYLearner1146/ANROT-HELANet/tree/main.
中文: ANROT-HELANet提出了一种基于Hellinger距离的鲁棒聚合网络,显著提升了小样本学习的性能和对抗自然干扰与对抗性攻击的鲁棒性,在多个基准测试中取得了最先进的成果。
English: ANROT-HELANet introduces a robust Hellinger distance-based aggregation network that significantly enhances few-shot learning performance and resilience against adversarial and natural perturbations, achieving state-of-the-art results across multiple benchmarks.

Authors:Yihang She, Andrew Blake, David Coomes, Srinivasan Keshav
Title: Scaling Up Forest Vision with Synthetic Data
Abstract:
Accurate tree segmentation is a key step in extracting individual tree metrics from forest laser scans, and is essential to understanding ecosystem functions in carbon cycling and beyond. Over the past decade, tree segmentation algorithms have advanced rapidly due to developments in AI. However existing, public, 3D forest datasets are not large enough to build robust tree segmentation systems. Motivated by the success of synthetic data in other domains such as self-driving, we investigate whether similar approaches can help with tree segmentation. In place of expensive field data collection and annotation, we use synthetic data during pretraining, and then require only minimal, real forest plot annotation for fine-tuning. We have developed a new synthetic data generation pipeline to do this for forest vision tasks, integrating advances in game-engines with physics-based LiDAR simulation. As a result, we have produced a comprehensive, diverse, annotated 3D forest dataset on an unprecedented scale. Extensive experiments with a state-of-the-art tree segmentation algorithm and a popular real dataset show that our synthetic data can substantially reduce the need for labelled real data. After fine-tuning on just a single, real, forest plot of less than 0.1 hectare, the pretrained model achieves segmentations that are competitive with a model trained on the full scale real data. We have also identified critical factors for successful use of synthetic data: physics, diversity, and scale, paving the way for more robust 3D forest vision systems in the future. Our data generation pipeline and the resulting dataset are available at https://github.com/yihshe/CAMP3D.git.
Chinese: 本研究开发了一种合成数据生成流程,通过使用合成数据进行预训练并结合极少量的真实数据进行微调,显著降低了树木分割对标注真实数据的需求,仅需0.1公顷真实林地标注即可达到与全量真实数据训练相媲美的分割效果。
English: This study introduces a synthetic data generation pipeline that significantly reduces the need for labeled real data in tree segmentation by using synthetic data for pretraining and minimal real data for fine-tuning, achieving competitive results with only 0.1 hectare of real forest plot annotation.

Authors:Chengde Lin, Xuezhu Gong, Shuxue Ding, Mingzhe Yang, Xijun Lu, Chengjun Mo
Title: StegOT: Trade-offs in Steganography via Optimal Transport
Abstract:
Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.
中文: 本文提出StegOT模型,一种基于自编码器并结合最优传输理论的隐写方法,通过多通道最优传输模块平衡载体与秘密图像的信息,提升隐写和恢复图像的质量。
English: This paper introduces StegOT, an autoencoder-based steganography model that uses optimal transport theory to balance information between cover and secret images, improving both stego and recovery image quality.

Authors:Zhiwen Yang, Yuxin Peng
Title: SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion
Abstract:
Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems, assessing voxel-level geometry and semantics for holistic scene perception. While existing voxel-based and plane-based SSC methods have achieved considerable progress, they struggle to capture physical regularities for realistic geometric details. On the other hand, neural reconstruction methods like NeRF and 3DGS demonstrate superior physical awareness, but suffer from high computational cost and slow convergence when handling large-scale, complex autonomous driving scenes, leading to inferior semantic accuracy. To address these issues, we propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC, which integrates voxel and Gaussian representations for joint exploitation of semantic and physical information. First, the Semantic-guided Gaussian Initialization (SGI) module leverages dual-branch 3D scene representations to locate focal voxels as anchors to guide efficient Gaussian initialization. Then, the Physical-aware Harmonics Enhancement (PHE) module incorporates semantic spherical harmonics to model physical-aware contextual details and promote semantic-geometry consistency through focal distribution alignment, generating SSC results with realistic details. Extensive experiments and analyses on the popular SemanticKITTI and SSCBench-KITTI-360 benchmarks validate the effectiveness of SPHERE. The code is available at https://github.com/PKU-ICST-MIPL/SPHERE_ACMMM2025.
中文摘要:SPHERE框架通过融合体素与高斯表示,结合语义引导和物理感知建模,在自动驾驶场景中实现了更优的几何细节与语义精度的3D语义场景补全。
English Summary: The proposed SPHERE framework integrates voxel and Gaussian representations to enhance camera-based 3D Semantic Scene Completion by combining semantic guidance with physical-aware modeling, achieving superior geometric details and semantic accuracy in autonomous driving scenes.

Authors:Pouria Mahdavinia, Hamed Mahdavi, Niloofar Mireshghallah, Mehrdad Mahdavi
Title: Harnessing Optimization Dynamics for Curvature-Informed Model Merging
Abstract:
Model merging is an effective post-training strategy for composing capabilities in large language models without joint retraining. We study this in the supervised fine-tuning (SFT) stage, where multiple capability-based SFT checkpoints -- spanning math, code, precise instruction following, general instruction following, and knowledge recall -- must be consolidated into a single model. We introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware aggregation that leverages optimizer second-moment statistics as a diagonal curvature proxy to reweight parameter edits and mitigate interference. Complementing OTA, we propose Fast Fisher Grafting (FFG), a curvature-driven task-localization step that sparsifies conflicting or low-importance edits. FFG induces extremely low-rank masks concentrated in early attention query/key projections and token embeddings, exploiting shared curvature across capabilities. We further develop a memory-light compression of the second moments that preserves OTA's effect. Across diverse capability-based SFT checkpoints, OTA+FFG improves merged-model quality over strong weight-space baselines, reduces negative transfer, and remains robust across sparsity levels. Analyses reveal substantial curvature overlap between checkpoints, offering a novel lens on why simple linear merging can be effective in practice. Ablations confirm that FFG is critical for reducing task interference and that the compressed second moments retain the gains of the full formulation. To facilitate reproducibility, we open-source all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints at https://github.com/pmahdavi/ota-merge.
中文:OTA合并与快速费舍尔嫁接是一种创新方法,通过曲率感知参数聚合和任务定位技术,有效整合多个专业能力的语言模型,减少任务干扰并提升综合性能。
English: OTA merging with Fast Fisher Grafting is a novel method that effectively combines multiple capability-specific language models by using curvature-aware parameter aggregation and task-localization to reduce interference and enhance performance.

Authors:Youquan Xian, Xueying Zeng, Mei Huang, Aoxiang Zhou, Xiaoyu Cui, Peng Liu, Lei Cui
Title: UDFS: Lightweight Representation-Driven Robust Network Traffic Classification
Abstract:
In recent years, sequence features such as packet length have received considerable attention due to their central role in encrypted traffic analysis. Existing sequence modeling approaches can be broadly categorized into flow-level and trace-level methods: the former suffer from high feature redundancy, limiting their discriminative power, whereas the latter preserve complete information but incur substantial computational and storage overhead. To address these limitations, we propose the \textbf{U}p-\textbf{D}own \textbf{F}low \textbf{S}equence (\textbf{UDFS}) representation, which compresses an entire trace into a two-dimensional sequence and characterizes each flow by the aggregate of its upstream and downstream traffic, reducing complexity while maintaining high discriminability. Furthermore, to address the challenge of class-specific discriminability differences, we propose an adaptive threshold mechanism that dynamically adjusts training weights and rejection boundaries, enhancing the model's classification performance. Experimental results demonstrate that the proposed method achieves superior classification performance and robustness on both coarse-grained and fine-grained datasets, as well as under concept drift and open-world scenarios. Code and Dataset are available at https://github.com/kid1999/UDFS.
中文摘要:提出的UDFS表示法将流量轨迹压缩为二维序列以降低复杂度同时保持区分度,自适应阈值机制进一步提升了模型在多种场景下的分类性能。
English Summary: The proposed UDFS representation compresses traffic traces into two-dimensional sequences to reduce complexity while maintaining discriminability, and an adaptive threshold mechanism further enhances classification performance across various scenarios.

Authors:Mintae Kim, Jiaze Cai, Koushil Sreenath
Title: RoVerFly: Robust and Versatile Learning-based Control of Quadrotor Across Payload Configurations
Abstract:
Designing robust controllers for precise, arbitrary trajectory tracking with quadrotors is challenging due to nonlinear dynamics and underactuation, and becomes harder with flexible cable-suspended payloads that introduce extra degrees of freedom and hybridness. Classical model-based methods offer stability guarantees but require extensive tuning and often do not adapt when the configuration changes, such as when a payload is added or removed, or when the payload mass or cable length varies. We present RoVerFly, a unified learning-based control framework in which a reinforcement learning (RL) policy serves as a robust and versatile tracking controller for standard quadrotors and for cable-suspended payload systems across a range of configurations. Trained with task and domain randomization, the controller is resilient to disturbances and varying dynamics. It achieves strong zero-shot generalization across payload settings, including no payload as well as varying mass and cable length, without controller switching or re-tuning, while retaining the interpretability and structure of a feedback tracking controller. Code and supplementary materials are available at https://github.com/mintaeshkim/roverfly
Chinese: RoVerFly是一个基于学习的统一控制框架,通过单一强化学习策略作为隐式混合控制器,无需重新调整即可在各种负载条件下实现强大的零样本泛化能力。
English: RoVerFly is a unified learning-based control framework that uses a single reinforcement learning policy as an implicit hybrid controller, achieving robust zero-shot generalization across various payload conditions without requiring retuning.

Authors:Mintae Kim, Jiaze Cai, Koushil Sreenath
Title: RoVerFly: Robust and Versatile Implicit Hybrid Control of Quadrotor-Payload Systems
Abstract:
Designing robust controllers for precise trajectory tracking with quadrotors is challenging due to nonlinear dynamics and underactuation, and becomes harder with flexible cable-suspended payloads that add degrees of freedom and hybrid dynamics. Classical model-based methods offer stability guarantees but require extensive tuning and often fail to adapt when the configuration changes-when a payload is added or removed, or when its mass or cable length varies. We present RoVerFly, a unified learning-based control framework where a single reinforcement learning (RL) policy functions as an implicit hybrid controller, managing complex dynamics without explicit mode detection or controller switching. Trained with task and domain randomization, the controller is resilient to disturbances and varying dynamics. It achieves strong zero-shot generalization across payload settings-including no payload as well as varying mass and cable length-without re-tuning, while retaining the interpretability and structure of a feedback tracking controller. Code and supplementary materials are available at https://github.com/mintaeshkim/roverfly.
Chinese: RoVerFly是一个基于学习的统一控制框架,通过单一强化学习策略作为隐式混合控制器,无需重新调整即可在各种负载条件下实现强大的零样本泛化能力。
English: RoVerFly is a unified learning-based control framework that uses a single reinforcement learning policy as an implicit hybrid controller, achieving robust zero-shot generalization across various payload conditions without requiring retuning.

Authors:Jing Xiao, Chang You, Zhiyu Chen
Title: AlignKT: Explicitly Modeling Knowledge State for Knowledge Tracing with Ideal State Alignment
Abstract:
Knowledge Tracing (KT) serves as a fundamental component of Intelligent Tutoring Systems (ITS), enabling these systems to monitor and understand learners' progress by modeling their knowledge state. However, many existing KT models primarily focus on fitting the sequences of learners' interactions, and often overlook the knowledge state itself. This limitation leads to reduced interpretability and insufficient instructional support from the ITS. To address this challenge, we propose AlignKT, which employs a frontend-to-backend architecture to explicitly model a stable knowledge state. In this approach, the preliminary knowledge state is aligned with an additional criterion. Specifically, we define an ideal knowledge state based on pedagogical theories as the alignment criterion, providing a foundation for interpretability. We utilize five encoders to implement this set-up, and incorporate a contrastive learning module to enhance the robustness of the alignment process. Through extensive experiments, AlignKT demonstrates superior performance, outperforming seven KT baselines on three real-world datasets. It achieves state-of-the-art results on two of these datasets and exhibits competitive performance on the third. The code of this work is available at https://github.com/SCNU203/AlignKT.
中文摘要:AlignKT采用前后端架构,通过教学理论对齐和对比学习显式建模稳定知识状态,在多个数据集上实现最优性能,同时提升了知识追踪的可解释性。
English Summary: AlignKT introduces a frontend-to-backend architecture that explicitly models stable knowledge states using pedagogical alignment and contrastive learning, achieving state-of-the-art performance on multiple datasets while enhancing interpretability in knowledge tracing.

Authors:Zhi Chen, Le Zhang
Title: UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction
Abstract:
Ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety. However, current AI research often treats disease prediction and tissue segmentation as two separate tasks and their model requires substantial computational overhead. In such a situation, we introduce UltraUPConvNet, a computationally efficient universal framework designed for both ultrasound image classification and segmentation. Trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions, our model achieves state-of-the-art performance on certain datasets with lower computational overhead. Our model weights and codes are available at https://github.com/yyxl123/UltraUPConvNet
中文: UltraUPConvNet是一种计算效率高的通用框架,在包含七个解剖区域超过9700个标注的大规模数据集上训练,能以较低计算开销在超声图像分类和分割任务中实现最先进的性能。
English: UltraUPConvNet is a computationally efficient universal framework that achieves state-of-the-art performance in both ultrasound image classification and segmentation with reduced computational overhead, trained on a large-scale dataset of over 9,700 annotations across seven anatomical regions.

Authors:Zhi Chen
Title: UltraUPConvNet: A UPerNet- and ConvNeXt-Based Multi-Task Network for Ultrasound Tissue Segmentation and Disease Prediction
Abstract:
Ultrasound imaging is widely used in clinical practice due to its cost-effectiveness, mobility, and safety. However, current AI research often treats disease prediction and tissue segmentation as two separate tasks and their model requires substantial computational overhead. In such a situation, we introduce UltraUPConvNet, a computationally efficient universal framework designed for both ultrasound image classification and segmentation. Trained on a large-scale dataset containing more than 9,700 annotations across seven different anatomical regions, our model achieves state-of-the-art performance on certain datasets with lower computational overhead. Our model weights and codes are available at https://github.com/yyxl123/UltraUPConvNet
中文: UltraUPConvNet是一种计算效率高的通用框架,在包含七个解剖区域超过9700个标注的大规模数据集上训练,能以较低计算开销在超声图像分类和分割任务中实现最先进的性能。
English: UltraUPConvNet is a computationally efficient universal framework that achieves state-of-the-art performance in both ultrasound image classification and segmentation with reduced computational overhead, trained on a large-scale dataset of over 9,700 annotations across seven anatomical regions.

Authors:Binhao Wang, Yutian Xiao, Maolin Wang, Zhiqi Li, Tianshuo Wei, Ruocheng Guo, Xiangyu Zhao
Title: SPARK: Adaptive Low-Rank Knowledge Graph Modeling in Hybrid Geometric Spaces for Recommendation
Abstract:
Knowledge Graphs (KGs) enhance recommender systems but face challenges from inherent noise, sparsity, and Euclidean geometry's inadequacy for complex relational structures, critically impairing representation learning, especially for long-tail entities. Existing methods also often lack adaptive multi-source signal fusion tailored to item popularity. This paper introduces SPARK, a novel multi-stage framework systematically tackling these issues. SPARK first employs Tucker low-rank decomposition to denoise KGs and generate robust entity representations. Subsequently, an SVD-initialized hybrid geometric GNN concurrently learns representations in Euclidean and Hyperbolic spaces; the latter is strategically leveraged for its aptitude in modeling hierarchical structures, effectively capturing semantic features of sparse, long-tail items. A core contribution is an item popularity-aware adaptive fusion strategy that dynamically weights signals from collaborative filtering, refined KG embeddings, and diverse geometric spaces for precise modeling of both mainstream and long-tail items. Finally, contrastive learning aligns these multi-source representations. Extensive experiments demonstrate SPARK's significant superiority over state-of-the-art methods, particularly in improving long-tail item recommendation, offering a robust, principled approach to knowledge-enhanced recommendation. Implementation code is available at https://github.com/Applied-Machine-Learning-Lab/SPARK.
中文摘要:SPARK是一个多阶段框架,通过去噪知识图谱、在欧几里得和双曲空间学习混合几何表示,并结合项目流行度的自适应多源信号融合,显著提升了推荐性能,尤其针对长尾项目。
English Summary: SPARK is a multi-stage framework that denoises knowledge graphs, learns hybrid geometric representations in Euclidean and Hyperbolic spaces, and adaptively fuses multi-source signals with item popularity awareness to significantly enhance recommendation performance, especially for long-tail items.

Authors:Chao Chen, Shunyu Yao, Yuanwu He, Tao Feng, Ruojing Song, Yuliang Guo, Xinyu Huang, Chenxu Wu, Ren Liu, Chen Feng
Title: End-to-End Visual Autonomous Parking via Control-Aided Attention
Abstract:
Precise parking requires an end-to-end system where perception adaptively provides policy-relevant details-especially in critical areas where fine control decisions are essential. End-to-end learning offers a unified framework by directly mapping sensor inputs to control actions, but existing approaches lack effective synergy between perception and control. We find that transformer-based self-attention, when used alone, tends to produce unstable and temporally inconsistent spatial attention, which undermines the reliability of downstream policy decisions over time. Instead, we propose CAA-Policy, an end-to-end imitation learning system that allows control signal to guide the learning of visual attention via a novel Control-Aided Attention (CAA) mechanism. For the first time, we train such an attention module in a self-supervised manner, using backpropagated gradients from the control outputs instead of from the training loss. This strategy encourages the attention to focus on visual features that induce high variance in action outputs, rather than merely minimizing the training loss-a shift we demonstrate leads to a more robust and generalizable policy. To further enhance stability, CAA-Policy integrates short-horizon waypoint prediction as an auxiliary task, and introduces a separately trained motion prediction module to robustly track the target spot over time. Extensive experiments in the CARLA simulator show that \titlevariable~consistently surpasses both the end-to-end learning baseline and the modular BEV segmentation + hybrid A* pipeline, achieving superior accuracy, robustness, and interpretability. Code is released at https://github.com/Joechencc/CAAPolicy.
Chinese: 提出的CAA-Policy通过控制辅助注意力机制,以自监督方式利用控制信号引导视觉注意力,显著提升了停车策略的鲁棒性,并在精度和稳定性上超越了现有方法。
English: The proposed CAA-Policy introduces a Control-Aided Attention mechanism that uses control signals to guide visual attention in a self-supervised manner, enhancing parking policy robustness and outperforming existing methods in accuracy and stability.

Authors:Xiaoyu Huang, Lauren M Maxson, Trang Nguyen, Cheng Jack Song, Yuankai Huo
Title: Organoid Tracker: A SAM2-Powered Platform for Zero-shot Cyst Analysis in Human Kidney Organoid Videos
Abstract:
Recent advances in organoid models have revolutionized the study of human kidney disease mechanisms and drug discovery by enabling scalable, cost-effective research without the need for animal sacrifice. Here, we present a kidney organoid platform optimized for efficient screening in polycystic kidney disease (PKD). While these systems generate rich spatial-temporal microscopy video datasets, current manual approaches to analysis remain limited to coarse classifications (e.g., hit vs. non-hit), often missing valuable pixel-level and longitudinal information. To help overcome this bottleneck, we developed Organoid Tracker, a graphical user interface (GUI) platform designed with a modular plugin architecture, which empowers researchers to extract detailed, quantitative metrics without programming expertise. Built on the cutting-edge vision foundation model Segment Anything Model 2 (SAM2), Organoid Tracker enables zero-shot segmentation and automated analysis of spatial-temporal microscopy videos. It quantifies key metrics such as cyst formation rate, growth velocity, and morphological changes, while generating comprehensive reports. By providing an extensible, open-source framework, Organoid Tracker offers a powerful solution for improving and accelerating research in kidney development, PKD modeling, and therapeutic discovery. The platform is publicly available as open-source software at https://github.com/hrlblab/OrganoidTracker.
Chinese: Organoid Tracker平台基于Segment Anything Model 2构建,能够对肾脏类器官显微镜视频进行自动化零样本分割和定量分析,为多囊肾病研究与药物发现提供高效工具。
English: The Organoid Tracker platform, built on the Segment Anything Model 2, enables automated, zero-shot segmentation and quantitative analysis of kidney organoid microscopy videos to accelerate research in polycystic kidney disease and drug discovery.

Authors:Paul Irofti, Luis Romero-Ben, Florin Stoican, Vicenç Puig
Title: Factor Graph Optimization for Leak Localization in Water Distribution Networks
Abstract:
Detecting and localizing leaks in water distribution network systems is an important topic with direct environmental, economic, and social impact. Our paper is the first to explore the use of factor graph optimization techniques for leak localization in water distribution networks, enabling us to perform sensor fusion between pressure and demand sensor readings and to estimate the network's temporal and structural state evolution across all network nodes. The methodology introduces specific water network factors and proposes a new architecture composed of two factor graphs: a leak-free state estimation factor graph and a leak localization factor graph. When a new sensor reading is obtained, unlike Kalman and other interpolation-based methods, which estimate only the current network state, factor graphs update both current and past states. Results on Modena, L-TOWN and synthetic networks show that factor graphs are much faster than nonlinear Kalman-based alternatives such as the UKF, while also providing improvements in localization compared to state-of-the-art estimation-localization approaches. Implementation and benchmarks are available at https://github.com/pirofti/FGLL.
中文摘要:本文首次将因子图优化技术应用于水管网络泄漏定位,通过融合压力与流量传感器数据实现全网状态估计,相比传统方法在定位精度和计算速度上均有显著提升。
English Summary: This paper pioneers the use of factor graph optimization for leak detection in water networks, enabling sensor fusion and state estimation across all nodes while outperforming traditional methods in speed and localization accuracy.

Authors:Lihi Nofar, Tomer Portal, Aviv Elbaz, Alexander Apartsin, Yehudit Aperstein
Title: An Interpretable Benchmark for Clickbait Detection and Tactic Attribution
Abstract:
The proliferation of clickbait headlines poses significant challenges to the credibility of information and user trust in digital media. While recent advances in machine learning have improved the detection of manipulative content, the lack of explainability limits their practical adoption. This paper presents a model for explainable clickbait detection that not only identifies clickbait titles but also attributes them to specific linguistic manipulation strategies. We introduce a synthetic dataset generated by systematically augmenting real news headlines using a predefined catalogue of clickbait strategies. This dataset enables controlled experimentation and detailed analysis of model behaviour. We present a two-stage framework for automatic clickbait analysis comprising detection and tactic attribution. In the first stage, we compare a fine-tuned BERT classifier with large language models (LLMs), specifically GPT-4.0 and Gemini 2.4 Flash, under both zero-shot prompting and few-shot prompting enriched with illustrative clickbait headlines and their associated persuasive tactics. In the second stage, a dedicated BERT-based classifier predicts the specific clickbait strategies present in each headline. This work advances the development of transparent and trustworthy AI systems for combating manipulative media content. We share the dataset with the research community at https://github.com/LLM-HITCS25S/ClickbaitTacticsDetection
中文摘要:本文提出一种可解释的点击诱饵检测模型,通过合成数据集和两阶段框架结合BERT与大语言模型,不仅能识别误导性标题,还能归因其具体语言操纵策略,推动透明AI系统的发展。
English Summary: This paper introduces an explainable clickbait detection model that identifies manipulative headlines and attributes them to specific linguistic strategies, using a synthetic dataset and a two-stage framework combining BERT and LLMs for transparent AI analysis.

Authors:Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang
Title: ToMA: Token Merge with Attention for Diffusion Models
Abstract:
Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers' quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose Token Merge with Attention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merge as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO $Δ< 0.07$), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.
中文摘要:ToMA是一种GPU高效令牌缩减方法,通过将令牌合并重构为子模优化问题和线性变换,在保持图像质量的同时将SDXL/Flux生成延迟降低24%/23%。
English Summary: ToMA is a GPU-efficient token reduction method that redesigns token merging as a submodular optimization problem and linear transformation, cutting SDXL/Flux latency by 24%/23% while maintaining image quality.

Authors:Ali Hedayatnia, Mostafa Tavassolipour, Babak Nadjar Araabi, Abdol-Hossein Vahabie
Title: Robustifying Diffusion-Denoised Smoothing Against Covariate Shift
Abstract:
Randomized smoothing is a well-established method for achieving certified robustness against l2-adversarial perturbations. By incorporating a denoiser before the base classifier, pretrained classifiers can be seamlessly integrated into randomized smoothing without significant performance degradation. Among existing methods, Diffusion Denoised Smoothing - where a pretrained denoising diffusion model serves as the denoiser - has produced state-of-the-art results. However, we show that employing a denoising diffusion model introduces a covariate shift via misestimation of the added noise, ultimately degrading the smoothed classifier's performance. To address this issue, we propose a novel adversarial objective function focused on the added noise of the denoising diffusion model. This approach is inspired by our understanding of the origin of the covariate shift. Our goal is to train the base classifier to ensure it is robust against the covariate shift introduced by the denoiser. Our method significantly improves certified accuracy across three standard classification benchmarks - MNIST, CIFAR-10, and ImageNet - achieving new state-of-the-art performance in l2-adversarial perturbations. Our implementation is publicly available at https://github.com/ahedayat/Robustifying-DDS-Against-Covariate-Shift
中文摘要:随机平滑是一种增强分类器对抗性鲁棒性的方法,本研究通过提出针对去噪扩散模型噪声估计偏差的新型对抗目标,有效解决了协变量偏移问题,在多个数据集上实现了最优认证精度。
English Summary: Randomized smoothing enhances classifier robustness against adversarial attacks, and this study introduces a novel adversarial objective to correct covariate shift in denoising diffusion models, achieving state-of-the-art certified accuracy on multiple benchmarks.

Authors:Weiqiang Zhao, Tianzhu Liu, Yuzhe Gui, Yanfeng Gu
Title: Total Variation Subgradient Guided Image Fusion for Dual-Camera CASSI System
Abstract:
Spectral imaging technology has long-faced fundamental challenges in balancing spectral, spatial, and temporal resolutions. While compressive sensing-based Coded Aperture Snapshot Spectral Imaging (CASSI) mitigates this trade-off through optical encoding, high compression ratios result in ill-posed reconstruction problems. Traditional model-based methods exhibit limited performance due to reliance on handcrafted inherent image priors, while deep learning approaches are constrained by their black-box nature, which compromises physical interpretability. To address these limitations, we propose a dual-camera CASSI reconstruction framework that integrates total variation (TV) subgradient theory. By establishing an end-to-end SD-CASSI mathematical model, we reduce the computational complexity of solving the inverse problem and provide a mathematically well-founded framework for analyzing multi-camera systems. A dynamic regularization strategy is introduced, incorporating normalized gradient constraints from RGB/panchromatic-derived reference images, which constructs a TV subgradient similarity function with strict convex optimization guarantees. Leveraging spatial priors from auxiliary cameras, an adaptive reference generation and updating mechanism is designed to provide subgradient guidance. Experimental results demonstrate that the proposed method effectively preserves spatial-spectral structural consistency. The theoretical framework establishes an interpretable mathematical foundation for computational spectral imaging, demonstrating robust performance across diverse reconstruction scenarios. The source code is available at https://github.com/bestwishes43/ADMM-TVDS.
中文摘要:该研究提出的双相机CASSI框架通过结合全变分次梯度理论与动态正则化策略,在保持光谱空间结构一致性的同时,为计算光谱成像建立了可解释的数学基础,有效解决了高压缩比下的病态重建问题。
English Summary: The proposed dual-camera CASSI framework integrates total variation subgradient theory with dynamic regularization to overcome reconstruction limitations, achieving enhanced spatial-spectral consistency while providing mathematical interpretability for computational spectral imaging.

Authors:Aryan Kashyap Naveen, Bhuvanesh Singla, Raajan Wankhade, Shreesha M, Ramu S, Ram Mohana Reddy Guddeti
Title: AutoOEP -- A Multi-modal Framework for Online Exam Proctoring
Abstract:
The burgeoning of online education has created an urgent need for robust and scalable systems to ensure academic integrity during remote examinations. Traditional human proctoring is often not feasible at scale, while existing automated solutions can be intrusive or fail to detect a wide range of cheating behaviors. This paper introduces AutoOEP (Automated Online Exam Proctoring), a comprehensive, multi-modal framework that leverages computer vision and machine learning to provide effective, automated proctoring. The system utilizes a dual-camera setup to capture both a frontal view of the examinee and a side view of the workspace, minimizing blind spots. Our approach integrates several parallel analyses: the Face Module performs continuous identity verification using ArcFace, along with head pose estimation, gaze tracking, and mouth movement analysis to detect suspicious cues. Concurrently, the Hand Module employs a fine-tuned YOLOv11 model for detecting prohibited items (e.g., mobile phones, notes) and tracks hand proximity to these objects. Features from these modules are aggregated and fed into a Long Short-Term Memory (LSTM) network that analyzes temporal patterns to calculate a real-time cheating probability score. We evaluate AutoOEP on a custom-collected dataset simulating diverse exam conditions. Our system achieves an accuracy of 90.7% in classifying suspicious activities. The object detection component obtains a mean Average Precision (mAP@.5) of 0.57 for prohibited items, and the entire framework processes video streams at approximately 2.4 frames per second without a GPU. The results demonstrate that AutoOEP is an effective and resource-efficient solution for automated proctoring, significantly reducing the need for human intervention and enhancing the integrity of online assessments. The code is public and can be accessed at https://github.com/05kashyap/AutoOEP.
中文摘要:本文提出AutoOEP系统,通过计算机视觉和机器学习技术,结合双摄像头进行面部行为分析和违禁物品检测,实现了90.7%的在线考试作弊行为识别准确率,为远程监考提供了高效解决方案。
English Summary: This paper introduces AutoOEP, a multi-modal automated proctoring system using computer vision and machine learning to detect cheating behaviors through facial analysis and object detection, achieving 90.7% accuracy in identifying suspicious activities during online exams.

Authors:Xinyu Zhang, Pei Zhang, Shuang Luo, Jialong Tang, Yu Wan, Baosong Yang, Fei Huang
Title: CultureSynth: A Hierarchical Taxonomy-Guided and Retrieval-Augmented Framework for Cultural Question-Answer Synthesis
Abstract:
Cultural competence, defined as the ability to understand and adapt to multicultural contexts, is increasingly vital for large language models (LLMs) in global environments. While several cultural benchmarks exist to assess LLMs' cultural competence, current evaluations suffer from fragmented taxonomies, domain specificity, and heavy reliance on manual data annotation. To address these limitations, we introduce CultureSynth, a novel framework comprising (1) a comprehensive hierarchical multilingual cultural taxonomy covering 12 primary and 130 secondary topics, and (2) a Retrieval-Augmented Generation (RAG)-based methodology leveraging factual knowledge to synthesize culturally relevant question-answer pairs. The CultureSynth-7 synthetic benchmark contains 19,360 entries and 4,149 manually verified entries across 7 languages. Evaluation of 14 prevalent LLMs of different sizes reveals clear performance stratification led by ChatGPT-4o-Latest and Qwen2.5-72B-Instruct. The results demonstrate that a 3B-parameter threshold is necessary for achieving basic cultural competence, models display varying architectural biases in knowledge processing, and significant geographic disparities exist across models. We believe that CultureSynth offers a scalable framework for developing culturally aware AI systems while reducing reliance on manual annotation\footnote{Benchmark is available at https://github.com/Eyr3/CultureSynth.}.
中文: 本文提出CultureSynth框架,通过构建多语言文化分类体系和基于检索增强生成的问答合成方法,解决了当前大模型文化能力评估的局限性,并在14个模型的测试中揭示了性能分层和地域差异现象。
English: This paper introduces CultureSynth, a scalable framework with a multilingual cultural taxonomy and RAG-based methodology to synthesize culturally relevant QA pairs, addressing limitations in current LLM cultural competence evaluations and revealing performance stratification and geographic disparities across 14 tested models.

Authors:Qingxiang Liu, Ting Huang, Zeyu Zhang, Hao Tang
Title: Nav-R1: Reasoning and Navigation in Embodied Scenes
Abstract:
Embodied navigation requires agents to integrate perception, reasoning, and action for robust interaction in complex 3D environments. Existing approaches often suffer from incoherent and unstable reasoning traces that hinder generalization across diverse environments, and difficulty balancing long-horizon semantic reasoning with low-latency control for real-time navigation. To address these challenges, we propose Nav-R1, an embodied foundation model that unifies reasoning in embodied environments. We first construct Nav-CoT-110K, a large-scale dataset of step-by-step Chains-of-Thought (CoT) for embodied tasks, which enables cold-start initialization with structured reasoning. Building on this foundation, we design a GRPO-based reinforcement learning framework with three complementary rewards: format, understanding, and navigation, to improve structural adherence, semantic grounding, and path fidelity. Furthermore, we introduce a Fast-in-Slow reasoning paradigm, decoupling deliberate semantic reasoning from low-latency reactive control for efficient yet coherent navigation. Extensive evaluations on embodied AI benchmarks demonstrate that Nav-R1 consistently outperforms strong baselines, with over 8% average improvement in reasoning and navigation performance. Real-world deployment on a mobile robot further validates its robustness under limited onboard resources. Code: https://github.com/AIGeeksGroup/Nav-R1. Website: https://aigeeksgroup.github.io/Nav-R1.
中文:Nav-R1模型通过构建大规模思维链数据集和强化学习框架,解决了具身导航中的推理不连贯问题,在仿真测试和真实机器人部署中均实现了显著的性能提升。
English: The Nav-R1 model addresses challenges in embodied navigation by unifying reasoning through a large-scale dataset and a reinforcement learning framework, achieving significant performance improvements in both simulated benchmarks and real-world robot deployment.

Authors:Jing Xiao, Hongfei Liu, Ruiqi Dong, Jimin Liu, Haoyong Yu
Title: Automated Radiology Report Generation Based on Topic-Keyword Semantic Guidance
Abstract:
Automated radiology report generation is essential in clinical practice. However, diagnosing radiological images typically requires physicians 5-10 minutes, resulting in a waste of valuable healthcare resources. Existing studies have not fully leveraged knowledge from historical radiology reports, lacking sufficient and accurate prior information. To address this, we propose a Topic-Keyword Semantic Guidance (TKSG) framework. This framework uses BiomedCLIP to accurately retrieve historical similar cases. Supported by multimodal, TKSG accurately detects topic words (disease classifications) and keywords (common symptoms) in diagnoses. The probabilities of topic terms are aggregated into a topic vector, serving as global information to guide the entire decoding process. Additionally, a semantic-guided attention module is designed to refine local decoding with keyword content, ensuring report accuracy and relevance. Experimental results show that our model achieves excellent performance on both IU X-Ray and MIMIC-CXR datasets. The code is available at https://github.com/SCNU203/TKSG.
中文: 提出的主题-关键词语义引导(TKSG)框架通过BiomedCLIP检索和多模态分析利用历史病例知识,显著提升了自动化放射学报告生成的准确性和相关性,在基准数据集上表现优异。
English: The proposed Topic-Keyword Semantic Guidance (TKSG) framework enhances automated radiology report generation by leveraging historical case knowledge through BiomedCLIP retrieval and multimodal analysis, achieving superior performance on benchmark datasets.

Authors:Tien-En Chang, Argon Chen
Title: Variable Selection Using Relative Importance Rankings
Abstract:
Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable ranking and filter-based selection before model creation. Specifically, we anticipate strong performance from the RI measures because they incorporate both direct and combined effects of predictors, addressing a key limitation of marginal correlation that ignores dependencies among predictors. We implement and evaluate the RI-based variable selection methods using general dominance (GD), comprehensive relative importance (CRI), and a newly proposed, computationally efficient variant termed CRI.Z. We first demonstrate how the RI measures more accurately rank the variables than the marginal correlation, especially when there are suppressed or weak predictors. We then show that predictive models built on these rankings are highly competitive, often outperforming state-of-the-art methods such as the lasso and relaxed lasso. The proposed RI-based methods are particularly effective in challenging cases involving clusters of highly correlated predictors, a setting known to cause failures in many benchmark methods. Although lasso methods have dominated the recent literature on variable selection, our study reveals that the RI-based method is a powerful and competitive alternative. We believe these underutilized tools deserve greater attention in statistics and machine learning communities. The code is available at: https://github.com/tien-endotchang/RI-variable-selection.
中文: 本文证明,相对重要性度量通过综合考虑预测变量的直接和联合效应,在变量筛选中优于边际相关性,并在处理高度相关预测变量时与套索等先进方法相媲美。
English: This paper demonstrates that relative importance measures, which account for both direct and combined predictor effects, outperform marginal correlation and are competitive with advanced methods like lasso in variable selection, particularly in handling correlated predictors.

Authors:Sangyeop Kim, Yohan Lee, Sanghwa Kim, Hyunjong Kim, Sungzoon Cho
Title: Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue
Abstract:
Effective long-term memory in conversational AI requires synthesizing information across multiple sessions. However, current systems place excessive reasoning burden on response generation, making performance significantly dependent on model sizes. We introduce PREMem (Pre-storage Reasoning for Episodic Memory), a novel approach that shifts complex reasoning processes from inference to memory construction. PREMem extracts fine-grained memory fragments categorized into factual, experiential, and subjective information; it then establishes explicit relationships between memory items across sessions, capturing evolution patterns like extensions, transformations, and implications. By performing this reasoning during pre-storage rather than when generating a response, PREMem creates enriched representations while reducing computational demands during interactions. Experiments show significant performance improvements across all model sizes, with smaller models achieving results comparable to much larger baselines while maintaining effectiveness even with constrained token budgets. Code and dataset are available at https://github.com/sangyeop-kim/PREMem.
中文: PREMem通过将复杂推理从响应生成转移到记忆构建,实现了跨会话细粒度记忆片段的分类与关联,在显著提升性能的同时有效降低了交互时的计算负担。
English: PREMem introduces a novel approach that shifts complex reasoning from response generation to memory construction by categorizing and linking fine-grained memory fragments across sessions, significantly improving performance while reducing computational demands during interactions.

Authors:Yixuan Tang, Yi Yang
Title: GAPrune: Gradient-Alignment Pruning for Domain-Aware Embeddings
Abstract:
Domain-specific embedding models have shown promise for applications that require specialized semantic understanding, such as coding agents and financial retrieval systems, often achieving higher performance gains than general models. However, state-of-the-art embedding models are typically based on LLMs, which contain billions of parameters, making deployment challenging in resource-constrained environments. Model compression through pruning offers a promising solution, but existing pruning methods treat all parameters uniformly, failing to distinguish between general semantic representations and domain-specific patterns, leading to suboptimal pruning decisions. Thus, we propose GAPrune, a pruning framework that addresses this challenge by considering both domain importance and preserving general linguistic foundation. Our method uses Fisher Information to measure importance and general-domain gradient alignment to assess parameter behavior, then combines these signals using our Domain Alignment Importance (DAI) scoring. Lower DAI scores indicate that the parameter is either less important for the domain task or creates conflicts between domain and general objectives. Experiments on two domain benchmarks, FinMTEB and ChemTEB, show that GAPrune maintains performance within 2.5% of dense models in one-shot pruning at 50% sparsity, while outperforming all baselines. With retraining in 100 steps, GAPrune achieves +4.51% improvement on FinMTEB and +1.73% on ChemTEB, demonstrating that our pruning strategy not only preserves but enhances domain-specific capabilities. Our findings demonstrate that principled pruning strategies can achieve model compression and enhanced domain specialization, providing the research community with a new approach for development.
中文: GAPrune是一种新颖的剪枝框架,通过领域对齐重要性评分对领域专用嵌入模型进行选择性压缩,在50%稀疏度下保持接近原始性能,并通过重训练进一步增强领域能力。
English: GAPrune is a novel pruning framework that uses Domain Alignment Importance scoring to selectively compress domain-specific embedding models, maintaining near-original performance at 50% sparsity while enhancing domain capabilities through retraining.

Authors:Simone Mosco, Daniel Fusaro, Wanmeng Li, Emanuele Menegatti, Alberto Pretto
Title: Point-Plane Projections for Accurate LiDAR Semantic Segmentation in Small Data Scenarios
Abstract:
LiDAR point cloud semantic segmentation is essential for interpreting 3D environments in applications such as autonomous driving and robotics. Recent methods achieve strong performance by exploiting different point cloud representations or incorporating data from other sensors, such as cameras or external datasets. However, these approaches often suffer from high computational complexity and require large amounts of training data, limiting their generalization in data-scarce scenarios. In this paper, we improve the performance of point-based methods by effectively learning features from 2D representations through point-plane projections, enabling the extraction of complementary information while relying solely on LiDAR data. Additionally, we introduce a geometry-aware technique for data augmentation that aligns with LiDAR sensor properties and mitigates class imbalance. We implemented and evaluated our method that applies point-plane projections onto multiple informative 2D representations of the point cloud. Experiments demonstrate that this approach leads to significant improvements in limited-data scenarios, while also achieving competitive results on two publicly available standard datasets, as SemanticKITTI and PandaSet. The code of our method is available at https://github.com/SiMoM0/3PNet
中文摘要:本文通过点云平面投影提取二维特征并结合几何感知数据增强技术,有效提升了激光雷达点云语义分割在有限数据场景下的性能,并在公开数据集上取得了具有竞争力的结果。
English summary: This paper enhances LiDAR point cloud semantic segmentation by using point-plane projections to extract 2D features and introducing geometry-aware data augmentation, achieving strong performance in data-limited scenarios and competitive results on standard datasets.

Authors:Eli Baum, Sam Buxbaum, Nitin Mathai, Muhammad Faisal, Vasiliki Kalavri, Mayank Varia, John Liagouris
Title: ORQ: Complex Analytics on Private Data with Strong Security Guarantees
Abstract:
We present ORQ, a system that enables collaborative analysis of large private datasets using cryptographically secure multi-party computation (MPC). ORQ protects data against semi-honest or malicious parties and can efficiently evaluate relational queries with multi-way joins and aggregations that have been considered notoriously expensive under MPC. To do so, ORQ eliminates the quadratic cost of secure joins by leveraging the fact that, in practice, the structure of many real queries allows us to join records and apply the aggregations "on the fly" while keeping the result size bounded. On the system side, ORQ contributes generic oblivious operators, a data-parallel vectorized query engine, a communication layer that amortizes MPC network costs, and a dataflow API for expressing relational analytics -- all built from the ground up. We evaluate ORQ in LAN and WAN deployments on a diverse set of workloads, including complex queries with multiple joins and custom aggregations. When compared to state-of-the-art solutions, ORQ significantly reduces MPC execution times and can process one order of magnitude larger datasets. For our most challenging workload, the full TPC-H benchmark, we report results entirely under MPC with Scale Factor 10 -- a scale that had previously been achieved only with information leakage or the use of trusted third parties.
中文: ORQ系统通过密码学安全的多方计算技术,实现了对大型私有数据集的高效协同分析,其创新性地采用实时聚合消除二次连接成本,并构建了通用 oblivious 操作符,在保持数据安全性的同时将处理性能提升了一个数量级。
English: ORQ is a secure multi-party computation system that enables efficient collaborative analysis of large private datasets by eliminating quadratic join costs through on-the-fly aggregation and introducing novel oblivious operators, significantly outperforming existing solutions in both speed and scalability.

Authors:Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, Marco Hutter
Title: RSL-RL: A Learning Library for Robotics Research
Abstract:
RSL-RL is an open-source Reinforcement Learning library tailored to the specific needs of the robotics community. Unlike broad general-purpose frameworks, its design philosophy prioritizes a compact and easily modifiable codebase, allowing researchers to adapt and extend algorithms with minimal overhead. The library focuses on algorithms most widely adopted in robotics, together with auxiliary techniques that address robotics-specific challenges. Optimized for GPU-only training, RSL-RL achieves high-throughput performance in large-scale simulation environments. Its effectiveness has been validated in both simulation benchmarks and in real-world robotic experiments, demonstrating its utility as a lightweight, extensible, and practical framework to develop learning-based robotic controllers. The library is open-sourced at: https://github.com/leggedrobotics/rsl_rl.
中文: RSL-RL是一个专为机器人学设计的开源强化学习库,具有轻量可修改的代码架构,通过GPU训练优化实现高效的学习型控制器开发。
English: RSL-RL is an open-source reinforcement learning library designed specifically for robotics, featuring a lightweight and modifiable codebase optimized for GPU training to enable efficient development of learning-based controllers.

Authors:Iman Barati, Mostafa Amiri, Heshaam Faili
Title: SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation
Abstract:
Supervised Fine-Tuning (SFT) is essential for training large language models (LLMs), significantly enhancing critical capabilities such as instruction following and in-context learning. Nevertheless, creating suitable training datasets tailored for specific domains remains challenging due to unique domain constraints and data scarcity. In this paper, we propose SearchInstruct, an innovative method explicitly designed to construct high quality instruction datasets for SFT. Our approach begins with a limited set of domain specific, human generated questions, which are systematically expanded using a large language model. Subsequently, domain relevant resources are dynamically retrieved to generate accurate and contextually appropriate answers for each augmented question. Experimental evaluation demonstrates that SearchInstruct enhances both the diversity and quality of SFT datasets, leading to measurable improvements in LLM performance within specialized domains. Additionally, we show that beyond dataset generation, the proposed method can also effectively facilitate tasks such as model editing, enabling efficient updates to existing models. To facilitate reproducibility and community adoption, we provide full implementation details, the complete set of generated instruction response pairs, and the source code in a publicly accessible Git repository: [https://github.com/mostafaamiri/SearchInstruct](https://github.com/mostafaamiri/SearchInstruct)
中文: 本文提出SearchInstruct方法,通过大语言模型扩展领域特定问题并检索相关资源生成精准答案,构建高质量监督微调指令数据集,有效提升大语言模型在专业领域的性能表现。
English: This paper introduces SearchInstruct, a novel method that constructs high-quality instruction datasets for supervised fine-tuning by expanding domain-specific questions with a large language model and retrieving relevant resources to generate accurate answers, thereby improving LLM performance in specialized domains.

Authors:Chin-Yun Yu, György Fazekas
Title: Sound Matching an Analogue Levelling Amplifier Using the Newton-Raphson Method
Abstract:
Automatic differentiation through digital signal processing algorithms for virtual analogue modelling has recently gained popularity. These algorithms are typically more computationally efficient than black-box neural networks that rely on dense matrix multiplications. Due to their differentiable nature, they can be integrated with neural networks and jointly trained using gradient descent algorithms, resulting in more efficient systems. Furthermore, signal processing algorithms have significantly fewer parameters than neural networks, allowing the application of the Newton-Raphson method. This method offers faster and more robust convergence than gradient descent at the cost of quadratic storage. This paper presents a method to emulate analogue levelling amplifiers using a feed-forward digital compressor with parameters optimised via the Newton-Raphson method. We demonstrate that a digital compressor can successfully approximate the behaviour of our target unit, the Teletronix LA-2A. Different strategies for computing the Hessian matrix are benchmarked. We leverage parallel algorithms for recursive filters to achieve efficient training on modern GPUs. The resulting model is made into a VST plugin and is open-sourced at https://github.com/aim-qmul/4a2a.
中文: 本文提出了一种利用牛顿-拉弗森方法优化的可微分数字压缩器来模拟模拟电平放大器的方法,通过GPU实现高效训练并成功复现Teletronix LA-2A的特性,最终成果作为开源VST插件发布。
English: This paper introduces a method for emulating analog leveling amplifiers using a differentiable digital compressor optimized via the Newton-Raphson method, achieving efficient GPU training and accurate approximation of the Teletronix LA-2A, with the resulting open-source VST plugin available online.

Authors:Chirayu Nimonkar, Shlok Shah, Catherine Ji, Benjamin Eysenbach
Title: Self-Supervised Goal-Reaching Results in Multi-Agent Cooperation and Exploration
Abstract:
For groups of autonomous agents to achieve a particular goal, they must engage in coordination and long-horizon reasoning. However, designing reward functions to elicit such behavior is challenging. In this paper, we study how self-supervised goal-reaching techniques can be leveraged to enable agents to cooperate. The key idea is that, rather than have agents maximize some scalar reward, agents aim to maximize the likelihood of visiting a certain goal. This problem setting enables human users to specify tasks via a single goal state rather than implementing a complex reward function. While the feedback signal is quite sparse, we will demonstrate that self-supervised goal-reaching techniques enable agents to learn from such feedback. On MARL benchmarks, our proposed method outperforms alternative approaches that have access to the same sparse reward signal as our method. While our method has no explicit mechanism for exploration, we observe that self-supervised multi-agent goal-reaching leads to emergent cooperation and exploration in settings where alternative approaches never witness a single successful trial.
中文: 通过自我监督的目标达成技术,自主智能体能够通过最大化访问指定目标状态的可能性来实现合作与长期推理,在相同稀疏奖励信号下优于其他方法,并促进探索行为的自然涌现。
English: Self-supervised goal-reaching techniques enable autonomous agents to achieve cooperation and long-horizon reasoning by maximizing the likelihood of visiting specified goal states, outperforming alternative methods with the same sparse reward signal and fostering emergent exploration.

Authors:Xiaoyang Ma, Yiyang Chai, Xinran Qu, Hong Sun
Title: USCTNet: A deep unfolding nuclear-norm optimization solver for physically consistent HSI reconstruction
Abstract:
Reconstructing hyperspectral images (HSIs) from a single RGB image is ill-posed and can become physically inconsistent when the camera spectral sensitivity (CSS) and scene illumination are misspecified. We formulate RGB-to-HSI reconstruction as a physics-grounded inverse problem regularized by a nuclear norm in a learnable transform domain, and we explicitly estimate CSS and illumination to define the forward operator embedded in each iteration, ensuring colorimetric consistency. To avoid the cost and instability of full singular-value decompositions (SVDs) required by singular-value thresholding (SVT), we introduce a data-adaptive low-rank subspace SVT operator. Building on these components, we develop USCTNet, a deep unfolding solver tailored to HSI that couples a parameter estimation module with learnable proximal updates. Extensive experiments on standard benchmarks show consistent improvements over state-of-the-art RGB-based methods in reconstruction accuracy. Code: https://github.com/psykheXX/USCTNet-Code-Implementation.git
Chinese Summary: 本研究提出USCTNet,一种通过将物理建模与可学习邻近算子相结合,从RGB图像重建高光谱图像的深度展开网络,在多个基准测试中展现出优于现有方法的精度。
English Summary: The study introduces USCTNet, a deep unfolding network that reconstructs hyperspectral images from RGB inputs by integrating physics-based modeling with learnable proximal operators, achieving superior accuracy over existing methods.

Authors:Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel
Title: Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses
Abstract:
3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly acquired in clinical settings to monitor a wide range of neurological conditions, including neurodegenerative disorders and stroke. While deep learning models have shown promising results analyzing 3D MRI across a number of brain imaging tasks, most are highly tailored for specific tasks with limited labeled data, and are not able to generalize across tasks and/or populations. The development of self-supervised learning (SSL) has enabled the creation of large medical foundation models that leverage diverse, unlabeled datasets ranging from healthy to diseased data, showing significant success in 2D medical imaging applications. However, even the very few foundation models for 3D brain MRI that have been developed remain limited in resolution, scope, or accessibility. In this work, we present a general, high-resolution SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on 18,759 patients (44,958 scans) from 11 publicly available datasets spanning diverse neurological diseases. We compare our model to Masked Autoencoders (MAE), as well as two supervised baselines, on four diverse downstream prediction tasks in both in-distribution and out-of-distribution settings. Our fine-tuned SimCLR model outperforms all other models across all tasks. Notably, our model still achieves superior performance when fine-tuned using only 20% of labeled training samples for predicting Alzheimer's disease. We use publicly available code and data, and release our trained model at https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly applicable and accessible foundation model for clinical brain MRI analysis.
中文: 本研究提出了一种基于SimCLR的高分辨率自监督基础模型,用于3D脑部MRI分析,该模型在多种任务中表现优异,即使在有限标注数据下仍保持卓越性能,并已公开共享以促进临床应用。
English: This work introduces a high-resolution, self-supervised SimCLR foundation model for 3D brain MRI, pre-trained on diverse datasets, which outperforms other models across multiple tasks and maintains strong performance with limited labeled data.

Authors:Nina Wiedemann, Dianne de Korte-de Boer, Matthias Richter, Sjors van de Weijer, Charlotte Buhre, Franz A. M. Eggert, Sophie Aarnoudse, Lotte Grevendonk, Steffen Röber, Carlijn M. E. Remie, Wolfgang Buhre, Ronald Henry, Jannis Born
Title: COVID-BLUeS -- A Prospective Study on the Value of AI in Lung Ultrasound Analysis
Abstract:
As a lightweight and non-invasive imaging technique, lung ultrasound (LUS) has gained importance for assessing lung pathologies. The use of Artificial intelligence (AI) in medical decision support systems is promising due to the time- and expertise-intensive interpretation, however, due to the poor quality of existing data used for training AI models, their usability for real-world applications remains unclear. In a prospective study, we analyze data from 63 COVID-19 suspects (33 positive) collected at Maastricht University Medical Centre. Ultrasound recordings at six body locations were acquired following the BLUE protocol and manually labeled for severity of lung involvement. Several AI models were applied and trained for detection and severity of pulmonary infection. The severity of the lung infection, as assigned by human annotators based on the LUS videos, is not significantly different between COVID-19 positive and negative patients (p = 0.89). Nevertheless, the predictions of image-based AI models identify a COVID-19 infection with 65% accuracy when applied zero-shot (i.e., trained on other datasets), and up to 79% with targeted training, whereas the accuracy based on human annotations is at most 65%. Multi-modal models combining images and CBC improve significantly over image-only models. Although our analysis generally supports the value of AI in LUS assessment, the evaluated models fall short of the performance expected from previous work. We find this is due to 1) the heterogeneity of LUS datasets, limiting the generalization ability to new data, 2) the frame-based processing of AI models ignoring video-level information, and 3) lack of work on multi-modal models that can extract the most relevant information from video-, image- and variable-based inputs. To aid future research, we publish the dataset at: https://github.com/NinaWie/COVID-BLUES.
Chinese: 肺部超声结合人工智能在评估肺部病变方面具有潜力,但由于数据集异质性和视频信息利用不足,现有模型在现实应用中受限,经针对性训练后对COVID-19检测的最高准确率可达79%。
English: Lung ultrasound combined with AI shows promise for assessing lung pathologies, but current models face limitations in real-world application due to dataset heterogeneity and insufficient video-level analysis, achieving up to 79% accuracy for COVID-19 detection with targeted training.

Authors:Miaoge Li, Yang Chen, Zhijie Rao, Can Jiang, Jingcai Guo
Title: Semantic-guided LoRA Parameters Generation
Abstract:
Low-Rank Adaptation (LoRA) has demonstrated strong generalization capabilities across a variety of tasks for efficiently fine-tuning AI models, especially on resource-constrained edges. However, in real-world applications, edge users often exhibit task-specific preferences that are difficult to handle with a unified model trained under a closed-world assumption, and the challenge may further increase when there are significant domain shifts between training and deployment. Meanwhile, retraining/fine-tuning models for each user is also impractical due to its cost-intensive nature and privacy concerns over raw data utilization from edges. To address these challenges, we propose Semantic-guided LoRA Parameter Generation (SG-LoRA), the first of its kind framework to efficiently produce user-specific LoRA parameters without any additional training on user tasks or access to user-specific data. Concretely, SG-LoRA uses task descriptions as the semantic bridge, measuring their proximity to a set of known expert tasks in a shared embedding space. Based on this semantic guidance, it models the target task's LoRA parameter distribution to generate high-performing parameters for novel tasks. SG-LoRA enables the real-time construction of LoRA models aligned with individual intents by distilling knowledge from prominent LoRA experts and, meanwhile, offering a privacy-preserving solution for personalized model adaptation in a novel zero-shot open-world setting proposed in this work. Extensive experiments on multiple challenging tasks confirm the superior performance and remarkable adaptability of SG-LoRA. Code is available at https://github.com/keepgoingjkg/SG-LoRA.
中文: SG-LoRA提出了一种创新框架,通过利用语义任务描述和专家知识,以零样本方式为边缘用户生成个性化的LoRA参数,无需额外训练或访问用户数据即可实现高效且保护隐私的模型适配。
English: SG-LoRA introduces a novel framework that generates personalized LoRA parameters for edge users in a zero-shot manner by leveraging semantic task descriptions and expert knowledge, enabling efficient and privacy-preserving model adaptation without additional training or access to user data.

Authors:Amirhossein Ghaffari, Huong Nguyen, Lauri Lovén, Ekaterina Gilman
Title: STM-Graph: A Python Framework for Spatio-Temporal Mapping and Graph Neural Network Predictions
Abstract:
Urban spatio-temporal data present unique challenges for predictive analytics due to their dynamic and complex nature. We introduce STM-Graph, an open-source Python framework that transforms raw spatio-temporal urban event data into graph representations suitable for Graph Neural Network (GNN) training and prediction. STM-Graph integrates diverse spatial mapping methods, urban features from OpenStreetMap, multiple GNN models, comprehensive visualization tools, and a graphical user interface (GUI) suitable for professional and non-professional users. This modular and extensible framework facilitates rapid experimentation and benchmarking. It allows integration of new mapping methods and custom models, making it a valuable resource for researchers and practitioners in urban computing. The source code of the framework and GUI are available at: https://github.com/Ahghaffari/stm_graph and https://github.com/tuminguyen/stm_graph_gui.
中文:STM-Graph是一个开源Python框架,可将城市时空数据转化为适用于图神经网络训练的图结构,其模块化设计、可视化工具和图形界面为城市计算领域的研究者和实践者提供了便捷支持。
English: STM-Graph is an open-source Python framework that converts urban spatio-temporal data into graph representations for GNN training, featuring modular design, visualization tools, and a GUI to support both researchers and practitioners in urban computing.

Authors:Prajit Sengupta, Islem Rekik
Title: FireGNN: Neuro-Symbolic Graph Neural Networks with Trainable Fuzzy Rules for Interpretable Medical Image Classification
Abstract:
Medical image classification requires not only high predictive performance but also interpretability to ensure clinical trust and adoption. Graph Neural Networks (GNNs) offer a powerful framework for modeling relational structures within datasets; however, standard GNNs often operate as black boxes, limiting transparency and usability, particularly in clinical settings. In this work, we present an interpretable graph-based learning framework named FireGNN that integrates trainable fuzzy rules into GNNs for medical image classification. These rules embed topological descriptors - node degree, clustering coefficient, and label agreement - using learnable thresholds and sharpness parameters to enable intrinsic symbolic reasoning. Additionally, we explore auxiliary self-supervised tasks (e.g., homophily prediction, similarity entropy) as a benchmark to evaluate the contribution of topological learning. Our fuzzy-rule-enhanced model achieves strong performance across five MedMNIST benchmarks and the synthetic dataset MorphoMNIST, while also generating interpretable rule-based explanations. To our knowledge, this is the first integration of trainable fuzzy rules within a GNN. Source Code: https://github.com/basiralab/FireGNN
中文摘要:FireGNN框架将可训练的模糊规则与图神经网络相结合,通过拓扑描述符实现符号推理,在提升医学图像分类性能的同时生成可解释的规则说明。
English Summary: The FireGNN framework integrates trainable fuzzy rules with Graph Neural Networks to enhance interpretability in medical image classification, achieving strong performance across multiple benchmarks while providing rule-based explanations.

Authors:Sai Teja Reddy Adapala
Title: The Anti-Ouroboros Effect: Emergent Resilience in Large Language Models from Recursive Selective Feedback
Abstract:
The stability of recursively trained large language models (LLMs) is a foundational problem for AI safety. Prevailing theory predicts model collapse, a progressive degradation when models are trained on their own output. We challenge this narrative by introducing a selective feedback mechanism. Contrary to expectation, instead of merely slowing decay, our experiments provide strong evidence that this pressure reverses it, inducing a statistically significant performance improvement in a Gemma 2B model on a complex summarization task. We name this phenomenon the Anti-Ouroboros Effect. We contrast this with a foundational experiment using a simple classifier, where the theoretical degenerative loop was validated, highlighting the unique dynamics of high-dimensional models. Our findings establish that systemic resilience can be an emergent property of LLMs under simple selection pressure, suggesting a powerful and scalable principle for developing safer and more robust AI systems. Across five generations, a quality-filtered condition improved by 6.6% in ROUGE-L F1 score, whereas an unfiltered control degraded by 3.5% and a random-filter control degraded by 4.2%
Chinese: 引入选择性反馈机制可逆转大语言模型的性能衰退,产生名为"反噬尾效应"的显著性能提升,证明在筛选压力下系统韧性可作为涌现属性出现。
English: Introducing a selective feedback mechanism reverses model degradation in LLMs, inducing significant performance improvement termed the Anti-Ouroboros Effect, demonstrating emergent systemic resilience under selection pressure.

Authors:Ning Yang, Junrui Wen, Meng Zhang, Ming Tang
Title: Generalizable Pareto-Optimal Offloading with Reinforcement Learning in Mobile Edge Computing
Abstract:
Mobile edge computing (MEC) is essential for next-generation mobile network applications that prioritize various performance metrics, including delays and energy efficiency. However, conventional single-objective scheduling solutions cannot be directly applied to practical systems in which the preferences (i.e., the weights of different objectives) are often unknown or challenging to specify in advance. In this study, we formulate a multi-objective offloading problem for MEC with multiple edges to minimize the sum of expected long-term energy consumption and delay while considering unknown preferences. To address the challenge of unknown preferences and the potentially diverse MEC systems, we propose a generalizable multi-objective (deep) reinforcement learning (GMORL)-based tasks offloading framework, which employs the Discrete Soft Actor-Critic (Discrete-SAC) method. Our method uses a single policy model to efficiently schedule tasks based on varying preferences and adapt to heterogeneous MEC systems with different CPU frequencies and server quantities. Under the proposed framework, we introduce a histogram-based state encoding method for constructing features for multiple edges in MEC systems, a sophisticated reward function for accurately computing the utilities of delay and energy consumption, and a novel neural network architecture for improving generalization. Simulation results demonstrate that our proposed GMORL scheme enhances the hypervolume of the Pareto front by up to $121.0\%$ compared to benchmarks. Our code are avavilable at https://github.com/gracefulning/Generalizable-Pareto-Optimal-Offloading-with-Reinforcement-Learning-in-Mobile-Edge-Computing
Chinese: 本研究提出了一种基于离散SAC的GMORL框架,用于在偏好未知的移动边缘计算系统中优化多目标任务卸载,相比基准方法将帕累托前沿超体积提升了最高达121.0%。
English: This study introduces a GMORL framework using Discrete-SAC to optimize multi-objective task offloading in MEC systems with unknown preferences, achieving up to 121.0% improvement in Pareto front hypervolume over benchmarks.

Authors:Christian Fane
Title: A Real-Time Diminished Reality Approach to Privacy in MR Collaboration
Abstract:
Diminished reality (DR) refers to the digital removal of real-world objects by compositing background content in their place. This thesis presents a real-time, inpainting-based DR system designed to enable privacy control in shared-space mixed reality (MR) meetings. The system allows a primary headset user to selectively remove personal or sensitive items from their environment, ensuring that those objects are no longer visible to other participants. Removal is achieved through semantic segmentation and precise object selection, followed by real-time inpainting from the viewpoint of a secondary observer, implemented using a mobile ZED 2i depth camera. The solution is designed to be portable and robust, requiring neither a fixed secondary viewpoint nor prior 3D scanning of the environment. The system utilises YOLOv11 for object detection and a modified Decoupled Spatial-Temporal Transformer (DSTT) model for high-quality video inpainting. At 720p resolution, the pipeline sustains frame rates exceeding 20 fps, demonstrating the feasibility of real-time diminished reality for practical privacy-preserving MR applications.
本论文提出了一种实时减实系统,通过语义分割和视频修复技术选择性地移除混合现实会议中的敏感物体,在720p分辨率下帧率超过20 fps,实现了隐私保护功能。
This thesis introduces a real-time diminished reality system that uses semantic segmentation and video inpainting to selectively remove sensitive objects from mixed reality meetings, achieving over 20 fps at 720p for privacy protection.

Authors:Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel
Title: SSL-AD: Spatiotemporal Self-Supervised Learning for Generalizability and Adaptability Across Alzheimer's Prediction Tasks and Datasets
Abstract:
Alzheimer's disease is a progressive, neurodegenerative disorder that causes memory loss and cognitive decline. While there has been extensive research in applying deep learning models to Alzheimer's prediction tasks, these models remain limited by lack of available labeled data, poor generalization across datasets, and inflexibility to varying numbers of input scans and time intervals between scans. In this study, we adapt three state-of-the-art temporal self-supervised learning (SSL) approaches for 3D brain MRI analysis, and add novel extensions designed to handle variable-length inputs and learn robust spatial features. We aggregate four publicly available datasets comprising 3,161 patients for pre-training, and show the performance of our model across multiple Alzheimer's prediction tasks including diagnosis classification, conversion detection, and future conversion prediction. Importantly, our SSL model implemented with temporal order prediction and contrastive learning outperforms supervised learning on six out of seven downstream tasks. It demonstrates adaptability and generalizability across tasks and number of input images with varying time intervals, highlighting its capacity for robust performance across clinical applications. We release our code and model publicly at https://github.com/emilykaczmarek/SSL-AD.
中文: 本研究采用时序自监督学习方法分析三维脑部核磁共振影像,通过处理可变输入和扫描间隔,在多项阿尔茨海默病预测任务中展现出优于监督学习的性能与跨数据集泛化能力。
English: This study adapts temporal self-supervised learning approaches for 3D brain MRI analysis to overcome limitations in Alzheimer's prediction, demonstrating superior performance over supervised methods across multiple tasks while handling variable inputs and time intervals.

Authors:Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong
Title: DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
Abstract:
Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs' long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.
中文: DeepDive通过从开放知识图谱自动合成复杂问题并应用端到端多轮强化学习,提升了大型语言模型的深度搜索能力,在多个基准测试中取得领先性能。
English: DeepDive enhances large language models' deep search capabilities by synthesizing complex questions from knowledge graphs and applying multi-turn reinforcement learning, achieving competitive results and improved reasoning.

Authors:Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry
Title: TASC: Task-Aware Shared Control for Teleoperated Manipulation
Abstract:
We present TASC, a Task-Aware Shared Control framework for teleoperated manipulation that infers task-level user intent and provides assistance throughout the task. To support everyday tasks without predefined knowledge, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides rotation assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in general-purpose, long-horizon shared control: (1) understanding and inferring task-level user intent, and (2) generalizing assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods. To the best of our knowledge, this is the first shared control framework that supports everyday manipulation tasks with zero-shot generalization. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.
中文: TASC是一种任务感知共享控制框架,通过开放词汇交互图推断用户意图并在操作任务中提供旋转辅助,实现了零样本泛化并显著提升了任务效率。
English: TASC is a task-aware shared control framework that infers user intent through open-vocabulary interaction graphs and provides rotation assistance during manipulation tasks, improving efficiency with zero-shot generalization across diverse scenarios.

Authors:Iacopo Curti, Pierluigi Zama Ramirez, Alioscia Petrelli, Luigi Di Stefano
Title: Multimodal SAM-adapter for Semantic Segmentation
Abstract:
Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM's rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal-SAM-Adapter.
中文摘要:MM SAM-adapter框架通过适配器网络将辅助传感器数据与RGB特征相融合,增强了多模态语义分割的鲁棒性,在多种复杂条件下均实现了最优性能。
English Summary: The MM SAM-adapter framework enhances multimodal semantic segmentation by integrating auxiliary sensor data with RGB features through an adapter network, achieving state-of-the-art robustness across diverse challenging conditions.

Authors:Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Zhiyuan Ning, Yue Zhang
Title: Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems
Abstract:
Failure attribution in multi-agent systems -- pinpointing the exact step where a decisive error occurs -- is a critical yet unsolved challenge. Current methods treat this as a pattern recognition task over long conversation logs, leading to critically low step-level accuracy (below 17\%), which renders them impractical for debugging complex systems. Their core weakness is a fundamental inability to perform robust counterfactual reasoning: to determine if correcting a single action would have actually averted the task failure. To bridge this \emph{counterfactual inference gap}, we introduce Abduct-Act-Predict (A2P) Scaffolding, a novel agent framework that transforms failure attribution from pattern recognition into a structured causal inference task. A2P explicitly guides a large language model through a formal three-step reasoning process within a single inference pass: (1) Abduction, to infer the hidden root causes behind an agent's actions; (2) Action, to define a minimal corrective intervention; and (3) Prediction, to simulate the subsequent trajectory and verify if the intervention resolves the failure. This structured approach leverages the holistic context of the entire conversation while imposing a rigorous causal logic on the model's analysis. Our extensive experiments on the Who\&When benchmark demonstrate its efficacy. On the Algorithm-Generated dataset, A2P achieves 47.46\% step-level accuracy, a 2.85$\times$ improvement over the 16.67\% of the baseline. On the more complex Hand-Crafted dataset, it achieves 29.31\% step accuracy, a 2.43$\times$ improvement over the baseline's 12.07\%. By reframing the problem through a causal lens, A2P Scaffolding provides a robust, verifiable, and significantly more accurate solution for automated failure attribution. Ours code are released at https://github.com/ResearAI/A2P.
中文摘要:A2P框架通过将失败归因转化为结构化因果推理任务,指导语言模型执行溯因-行动-预测的三步推理,在基准测试中实现了最高2.85倍的步骤级准确率提升。
English Summary: The A2P Scaffolding framework transforms failure attribution from pattern recognition into structured causal inference, achieving up to 2.85× accuracy improvement by guiding language models through abductive reasoning about root causes and counterfactual interventions.

Authors:Seokjin Go, Joongun Park, Spandan More, Hanjiang Wu, Irene Wang, Aaron Jezghani, Tushar Krishna, Divya Mahajan
Title: Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
Abstract:
The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper, we present a comprehensive characterization of LLM training across diverse real-world workloads and hardware platforms, including NVIDIA H100/H200 and AMD MI250 GPUs. We analyze dense and sparse models under various parallelism strategies -- tensor, pipeline, data, and expert -- and evaluate their effects on hardware utilization, power consumption, and thermal behavior. We further evaluate the effectiveness of optimizations such as activation recomputation and compute-communication overlap. Our findings show that performance is not determined solely by scaling hardware capacity. Scale-up systems with fewer, higher-memory GPUs can outperform scale-out systems in communication-bound regimes, but only under carefully tuned configurations; in other cases, scale-out deployments achieve superior throughput. We also show that certain parallelism combinations, such as tensor with pipeline, lead to bandwidth underutilization due to inefficient data chunking, while increasing microbatch sizes beyond a certain point induces bursty execution and peak power excursions that worsen thermal throttling. These insights reveal how training performance is shaped by complex interactions between hardware, system topology, and model execution. We conclude by offering recommendations for system and hardware design to improve the scalability and reliability of future LLM systems and workloads. The source code of this project is available at https://github.com/sitar-lab/CharLLM-PPT.
中文摘要:研究表明,大语言模型训练性能取决于硬件、系统拓扑和模型执行之间的复杂交互,在优化配置下,纵向扩展系统在通信受限场景中有时优于横向扩展部署。
English Summary: The study reveals that LLM training performance depends on complex interactions between hardware, system topology, and model execution, with scale-up systems sometimes outperforming scale-out configurations in communication-bound scenarios under optimized settings.

Authors:Fabien Allemand, Attilio Fiandrotti, Sumanta Chaudhuri, Alaa Eddine Mazouz
Title: Efficient Learned Image Compression Through Knowledge Distillation
Abstract:
Learned image compression sits at the intersection of machine learning and image processing. With advances in deep learning, neural network-based compression methods have emerged. In this process, an encoder maps the image to a low-dimensional latent space, which is then quantized, entropy-coded into a binary bitstream, and transmitted to the receiver. At the receiver end, the bitstream is entropy-decoded, and a decoder reconstructs an approximation of the original image. Recent research suggests that these models consistently outperform conventional codecs. However, they require significant processing power, making them unsuitable for real-time use on resource-constrained platforms, which hinders their deployment in mainstream applications. This study aims to reduce the resource requirements of neural networks used for image compression by leveraging knowledge distillation, a training paradigm where smaller neural networks, partially trained on the outputs of larger, more complex models, can achieve better performance than when trained independently. Our work demonstrates that knowledge distillation can be effectively applied to image compression tasks: i) across various architecture sizes, ii) to achieve different image quality/bit rate tradeoffs, and iii) to save processing and energy resources. This approach introduces new settings and hyperparameters, and future research could explore the impact of different teacher models, as well as alternative loss functions. Knowledge distillation could also be extended to transformer-based models. The code is publicly available at: https://github.com/FABallemand/PRIM .
中文: 本研究采用知识蒸馏技术优化基于神经网络的图像压缩方法,在降低计算资源消耗的同时,通过在不同架构与质量-码率权衡中保持性能,有效提升了模型实用性。
English: This study employs knowledge distillation to enhance neural network-based image compression by reducing computational demands while maintaining performance across various architectures and quality-bitrate tradeoffs.

Authors:Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang
Title: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Abstract:
Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.
中文: 本文提出聚类驱动特征缓存(ClusCa)方法,通过对令牌进行空间聚类将计算量减少90%以上,无需重新训练即可显著提升文生图和文生视频的生成速度。
English: This paper introduces Cluster-Driven Feature Caching (ClusCa), a method that accelerates diffusion transformers by performing spatial clustering to reduce token computations by over 90%, achieving significant speed improvements in text-to-image and text-to-video generation without requiring retraining.

Authors:Marco Artiano, Oswald Knoth, Peter Spichtinger, Hendrik Ranocha
Title: Structure-Preserving High-Order Methods for the Compressible Euler Equations in Potential Temperature Formulation for Atmospheric Flows
Abstract:
We develop structure-preserving numerical methods for the compressible Euler equations, employing potential temperature as a prognostic variable. We construct three numerical fluxes designed to ensure the conservation of entropy and total energy within the discontinuous Galerkin framework on general curvilinear meshes. Furthermore, we introduce a generalization for the kinetic energy preservation property and total energy conservation in the presence of a gravitational potential term. To this end, we adopt a flux-differencing approach for the discretization of the source term, treated as non-conservative product. We present well-balanced schemes for different constant background states for both formulations (total energy and potential temperature) on curvilinear meshes. Finally, we validate the methods by comparing the potential temperature formulation with the traditional Euler equations formulation across a range of classical atmospheric scenarios.
中文摘要:本文针对可压缩欧拉方程开发了以位温为预报变量的保结构数值方法,提出了保持熵和总能量守恒的数值通量,并在多种大气场景中验证了与传统欧拉方程相比的优越性。
English Summary: This paper develops structure-preserving numerical methods for compressible Euler equations using potential temperature as a prognostic variable, featuring entropy-conserving fluxes and well-balanced schemes validated across atmospheric scenarios.

Authors:Evan Murphy, Marco Viola, Vladimir A. Krylov
Title: A Stochastic Birth-and-Death Approach for Street Furniture Geolocation in Urban Environments
Abstract:
In this paper we address the problem of precise geolocation of street furniture in complex urban environments, which is a critical task for effective monitoring and maintenance of public infrastructure by local authorities and private stakeholders. To this end, we propose a probabilistic framework based on energy maps that encode the spatial likelihood of object locations. Representing the energy in a map-based geopositioned format allows the optimisation process to seamlessly integrate external geospatial information, such as GIS layers, road maps, or placement constraints, which improves contextual awareness and localisation accuracy. A stochastic birth-and-death optimisation algorithm is introduced to infer the most probable configuration of assets. We evaluate our approach using a realistic simulation informed by a geolocated dataset of street lighting infrastructure in Dublin city centre, demonstrating its potential for scalable and accurate urban asset mapping. The implementation of the algorithm will be made available in the GitHub repository https://github.com/EMurphy0108/SBD_Street_Furniture.
中文: 本文提出了一种基于能量映射的概率框架和随机生死算法,用于在城市环境中精确定位街道设施,通过整合外部地理空间数据提高了定位准确性,并利用都柏林街灯基础设施的模拟数据验证了其有效性。
English: This paper introduces a probabilistic framework using energy maps and a stochastic birth-and-death algorithm to accurately geolocate street furniture in urban settings, enhancing localization through integration with external geospatial data and demonstrating effectiveness via simulations with Dublin's street lighting infrastructure.

Authors:Joshua Dimasaka, Christian Geiß, Robert Muir-Wood, Emily So
Title: GraphCSVAE: Graph Categorical Structured Variational Autoencoder for Spatiotemporal Auditing of Physical Vulnerability Towards Sustainable Post-Disaster Risk Reduction
Abstract:
In the aftermath of disasters, many institutions worldwide face challenges in continually monitoring changes in disaster risk, limiting the ability of key decision-makers to assess progress towards the UN Sendai Framework for Disaster Risk Reduction 2015-2030. While numerous efforts have substantially advanced the large-scale modeling of hazard and exposure through Earth observation and data-driven methods, progress remains limited in modeling another equally important yet challenging element of the risk equation: physical vulnerability. To address this gap, we introduce Graph Categorical Structured Variational Autoencoder (GraphCSVAE), a novel probabilistic data-driven framework for modeling physical vulnerability by integrating deep learning, graph representation, and categorical probabilistic inference, using time-series satellite-derived datasets and prior expert belief systems. We introduce a weakly supervised first-order transition matrix that reflects the changes in the spatiotemporal distribution of physical vulnerability in two disaster-stricken and socioeconomically disadvantaged areas: (1) the cyclone-impacted coastal Khurushkul community in Bangladesh and (2) the mudslide-affected city of Freetown in Sierra Leone. Our work reveals post-disaster regional dynamics in physical vulnerability, offering valuable insights into localized spatiotemporal auditing and sustainable strategies for post-disaster risk reduction.
中文: 全球机构在监测灾害风险变化上面临挑战,影响联合国仙台框架进展评估,为此我们提出GraphCSVAE框架,结合深度学习和卫星数据建模物理脆弱性,揭示受灾地区灾后动态,为风险减缓提供策略洞见。
English: Global institutions struggle to monitor disaster risk changes effectively, hindering progress assessment under the UN Sendai Framework, prompting the development of GraphCSVAE, a novel framework that models physical vulnerability using deep learning and satellite data to reveal post-disaster dynamics in vulnerable regions.

Authors:Zeyneddin Oz, Jonas Knoche, Alireza Yazdani, Bernd Engel, Kristof Van Laerhoven
Title: TubeBEND: A Real-World Dataset for Geometry Prediction in Rotary Draw Bending
Abstract:
This paper presents TubeBEND, a real-world dataset comprising 318 rotary tube bending processes, which were collected and sorted by experts from various fields to evaluate machine learning and signal analysis methods. The dataset addresses the industrial challenge of predicting the geometry of a first-stage bend, which can be beneficial for designing machine clamping molds for the second-stage bend in two-stage rotary draw bending. Some geometry criteria, such as the tube's final bent angle (or springback) and its cross-sectional deformation, are being recorded in this dataset. This dataset gives us the possibility to build and test machine learning models that can predict the geometry and help the machine operators with a better machine setup to optimize the tube's springback and deformation. Moreover, by recording some process parameters, such as tool movements and forces or torques applied to them, we deliver detailed information about their impacts on the final tube geometry. The focus of our work is to discover solutions that can replace traditional methods, such as trial-and-error or simulation-based predictions, by including experimental process variables in ML algorithms. Our dataset is publicly available at https://github.com/zeyneddinoz/tubebend and https://zenodo.org/records/16614082 as a benchmark to improve data-driven methods in this field.
中文: TubeBEND是一个包含318个旋转弯管过程的公开数据集,旨在通过机器学习模型预测管材几何形状并优化回弹和变形,以替代传统的试错方法。
English: TubeBEND is a publicly available dataset of 318 rotary tube bending processes designed to enable machine learning models to predict tube geometry and optimize springback and deformation, replacing traditional trial-and-error methods.

Authors:Xinhong Zhang, Runqing Wang, Yunfan Ren, Jian Sun, Hao Fang, Jie Chen, Gang Wang
Title: DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning
Abstract:
This letter introduces DiffAero, a lightweight, GPU-accelerated, and fully differentiable simulation framework designed for efficient quadrotor control policy learning. DiffAero supports both environment-level and agent-level parallelism and integrates multiple dynamics models, customizable sensor stacks (IMU, depth camera, and LiDAR), and diverse flight tasks within a unified, GPU-native training interface. By fully parallelizing both physics and rendering on the GPU, DiffAero eliminates CPU-GPU data transfer bottlenecks and delivers orders-of-magnitude improvements in simulation throughput. In contrast to existing simulators, DiffAero not only provides high-performance simulation but also serves as a research platform for exploring differentiable and hybrid learning algorithms. Extensive benchmarks and real-world flight experiments demonstrate that DiffAero and hybrid learning algorithms combined can learn robust flight policies in hours on consumer-grade hardware. The code is available at https://github.com/flyingbitac/diffaero.
中文:DiffAero是一个完全可微的GPU加速仿真框架,通过高度并行化和集成多种动力学模型,能在消费级硬件上数小时内完成鲁棒的四旋翼控制策略学习。
English: DiffAero is a fully differentiable, GPU-accelerated simulation framework that enables efficient quadrotor control policy learning with high parallelism and integrated dynamics models, achieving robust policy training in hours on consumer hardware.

Authors:Shiwei Li, Qunwei Li, Haozhao Wang, Ruixuan Li, Jianbin Lin, Wenliang Zhong
Title: FedBiF: Communication-Efficient Federated Learning via Bits Freezing
Abstract:
Federated learning (FL) is an emerging distributed machine learning paradigm that enables collaborative model training without sharing local data. Despite its advantages, FL suffers from substantial communication overhead, which can affect training efficiency. Recent efforts have mitigated this issue by quantizing model updates to reduce communication costs. However, most existing methods apply quantization only after local training, introducing quantization errors into the trained parameters and potentially degrading model accuracy. In this paper, we propose Federated Bit Freezing (FedBiF), a novel FL framework that directly learns quantized model parameters during local training. In each communication round, the server first quantizes the model parameters and transmits them to the clients. FedBiF then allows each client to update only a single bit of the multi-bit parameter representation, freezing the remaining bits. This bit-by-bit update strategy reduces each parameter update to one bit while maintaining high precision in parameter representation. Extensive experiments are conducted on five widely used datasets under both IID and Non-IID settings. The results demonstrate that FedBiF not only achieves superior communication compression but also promotes sparsity in the resulting models. Notably, FedBiF attains accuracy comparable to FedAvg, even when using only 1 bit-per-parameter (bpp) for uplink and 3 bpp for downlink communication. The code is available at https://github.com/Leopold1423/fedbif-tpds25.
中文: 本文提出联邦比特冻结(FedBiF)这一新颖联邦学习框架,通过在本地训练中直接学习量化参数并每次仅更新单个比特,实现了高压缩比和与FedAvg相当的精度,同时大幅降低了通信开销。
English: This paper introduces Federated Bit Freezing (FedBiF), a novel federated learning framework that directly learns quantized parameters during local training by updating only one bit per parameter, achieving high compression and accuracy comparable to FedAvg with minimal communication costs.

Authors:Minsang Kong, Myeongjun Kim, Sang Gu Kang, Sang Hun Lee
Title: BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird's-Eye View with Deformable Attention and Sparse Goal Proposals
Abstract:
In autonomous driving, trajectory prediction is essential for ensuring safe and efficient navigation. To improve prediction accuracy, recent approaches often rely on pre-built high-definition (HD) maps or real-time local map construction modules to incorporate static environmental information. However, pre-built HD maps are limited to specific regions and cannot adapt to transient changes. In addition, local map construction modules, which recognize only predefined elements, may fail to capture critical scene details or introduce errors that degrade prediction performance. To overcome these limitations, we propose Bird's-Eye View Trajectory Prediction (BEVTraj), a novel trajectory prediction framework that operates directly in the bird's-eye view (BEV) space utilizing real-time sensor data without relying on any pre-built maps. The BEVTraj leverages deformable attention to efficiently extract relevant context from dense BEV features. Furthermore, we introduce a Sparse Goal Candidate Proposal (SGCP) module, which enables full end-to-end prediction without requiring any post-processing steps. Extensive experiments demonstrate that the BEVTraj achieves performance comparable to state-of-the-art HD map-based models while offering greater flexibility by eliminating the dependency on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.
中文: BEVTraj是一种新型自动驾驶轨迹预测框架,直接在鸟瞰图空间使用实时传感器数据,通过可变形注意力和稀疏目标候选提议模块,在摆脱预建高精地图依赖的同时实现了可比性能。
English: BEVTraj is a novel autonomous driving trajectory prediction framework that uses real-time sensor data in bird's-eye view space, eliminating dependency on pre-built HD maps while achieving comparable performance through deformable attention and sparse goal candidate proposal modules.

Authors:Yue Zhou, Litong Feng, Mengcheng Lan, Xue Yang, Qingyun Li, Yiping Ke, Xue Jiang, Wayne Zhang
Title: Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration
Abstract:
Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math
中文: AVI-Math基准测试首次评估无人机图像中的多模态数学推理能力,发现现有视觉语言模型在此领域存在明显不足,为未来研究指明了方向。
English: The AVI-Math benchmark is introduced to evaluate multimodal mathematical reasoning in UAV imagery, revealing that current vision-language models struggle with these complex tasks despite their broader successes.

Authors:Hailong Yang, Mingxian Gu, Jianqi Wang, Guanjin Wang, Zhaohong Deng
Title: XAgents: A Unified Framework for Multi-Agent Cooperation via IF-THEN Rules and Multipolar Task Processing Graph
Abstract:
The rapid advancement of Large Language Models (LLMs) has significantly enhanced the capabilities of Multi-Agent Systems (MAS) in supporting humans with complex, real-world tasks. However, MAS still face challenges in effective task planning when handling highly complex tasks with uncertainty, often resulting in misleading or incorrect outputs that hinder task execution. To address this, we propose XAgents, a unified multi-agent cooperative framework built on a multipolar task processing graph and IF-THEN rules. XAgents uses the multipolar task processing graph to enable dynamic task planning and handle task uncertainty. During subtask processing, it integrates domain-specific IF-THEN rules to constrain agent behaviors, while global rules enhance inter-agent collaboration. We evaluate the performance of XAgents across three distinct datasets, demonstrating that it consistently surpasses state-of-the-art single-agent and multi-agent approaches in both knowledge-typed and logic-typed question-answering tasks. The codes for XAgents are available at: https://github.com/AGI-FHBC/XAgents.
Chinese: XAgents是一个统一的多智能体协作框架,通过多极任务处理图和IF-THEN规则改进任务规划并处理不确定性,在知识和逻辑型问答任务中持续超越现有最优方法。
English: XAgents is a unified multi-agent cooperative framework that enhances task planning and handles uncertainty through a multipolar task processing graph and IF-THEN rules, consistently outperforming state-of-the-art approaches in knowledge-typed and logic-typed question-answering tasks.

Authors:Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Joey Tianyi Zhou, Jianshu Li
Title: LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
Abstract:
As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to ~9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by ~2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}
中文: LaV-CoT框架通过语言感知的视觉思维链和多方面奖励优化,在多语言视觉问答任务中显著超越现有模型,实现了高达9.5%的准确率提升。
English: The LaV-CoT framework introduces a language-aware visual chain-of-thought approach with multi-aspect reward optimization, achieving significant accuracy improvements over existing models in multilingual visual question answering tasks.

Authors:Xiaodong Guo, Tong Liu, Yike Li, Zi'ang Lin, Zhihong Deng
Title: TUNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion
Abstract:
RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model's real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder's capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.
中文: TUNI提出了一种统一的RGB-T编码器,集成特征提取与融合,实现高效语义分割,在降低复杂度的同时保持竞争力并具备实时推理能力。
English: TUNI introduces a unified RGB-T encoder that integrates feature extraction and fusion for efficient semantic segmentation, achieving competitive performance with reduced complexity and real-time inference.

Authors:Siying Liu, Zikai Wang, Hanle Zheng, Yifan Hu, Xilin Wang, Qingkai Yang, Jibin Wu, Hao Guo, Lei Deng
Title: ISTASTrack: Bridging ANN and SNN via ISTA Adapter for RGB-Event Tracking
Abstract:
RGB-Event tracking has become a promising trend in visual object tracking to leverage the complementary strengths of both RGB images and dynamic spike events for improved performance. However, existing artificial neural networks (ANNs) struggle to fully exploit the sparse and asynchronous nature of event streams. Recent efforts toward hybrid architectures combining ANNs and spiking neural networks (SNNs) have emerged as a promising solution in RGB-Event perception, yet effectively fusing features across heterogeneous paradigms remains a challenge. In this work, we propose ISTASTrack, the first transformer-based \textbf{A}NN-\textbf{S}NN hybrid \textbf{Track}er equipped with \textbf{ISTA} adapters for RGB-Event tracking. The two-branch model employs a vision transformer to extract spatial context from RGB inputs and a spiking transformer to capture spatio-temporal dynamics from event streams. To bridge the modality and paradigm gap between ANN and SNN features, we systematically design a model-based ISTA adapter for bidirectional feature interaction between the two branches, derived from sparse representation theory by unfolding the iterative shrinkage thresholding algorithm. Additionally, we incorporate a temporal downsampling attention module within the adapter to align multi-step SNN features with single-step ANN features in the latent space, improving temporal fusion. Experimental results on RGB-Event tracking benchmarks, such as FE240hz, VisEvent, COESOT, and FELT, have demonstrated that ISTASTrack achieves state-of-the-art performance while maintaining high energy efficiency, highlighting the effectiveness and practicality of hybrid ANN-SNN designs for robust visual tracking. The code is publicly available at https://github.com/lsying009/ISTASTrack.git.
中文摘要:ISTASTrack是一种基于Transformer的混合跟踪器,通过ISTA适配器融合人工神经网络和脉冲神经网络分支,在多个RGB-事件跟踪基准上实现了最优性能,同时保持高能效。
English Summary: ISTASTrack is a novel transformer-based hybrid tracker that integrates ANN and SNN branches with ISTA adapters for effective RGB-Event fusion, achieving state-of-the-art performance across multiple benchmarks while maintaining energy efficiency.

Authors:Zhitian Hou, Zihan Ye, Nanli Zeng, Tianyong Hao, Kun Zeng
Title: Large Language Models Meet Legal Artificial Intelligence: A Survey
Abstract:
Large Language Models (LLMs) have significantly advanced the development of Legal Artificial Intelligence (Legal AI) in recent years, enhancing the efficiency and accuracy of legal tasks. To advance research and applications of LLM-based approaches in legal domain, this paper provides a comprehensive review of 16 legal LLMs series and 47 LLM-based frameworks for legal tasks, and also gather 15 benchmarks and 29 datasets to evaluate different legal capabilities. Additionally, we analyse the challenges and discuss future directions for LLM-based approaches in the legal domain. We hope this paper provides a systematic introduction for beginners and encourages future research in this field. Resources are available at https://github.com/ZhitianHou/LLMs4LegalAI.
中文: 本文系统综述了16个法律大模型系列和47个基于大模型的法律任务框架,汇集了15个基准测试和29个数据集,通过分析挑战与未来方向推动法律人工智能发展,并为初学者提供研究资源。
English: This paper comprehensively reviews 16 legal LLM series and 47 LLM-based frameworks, along with 15 benchmarks and 29 datasets, to advance Legal AI by analyzing challenges and future directions while providing resources for beginners.

Authors:Anne Marthe Sophie Ngo Bibinbe, Chiron Bang, Patrick Gagnon, Jamie Ahloy-Dallaire, Eric R. Paquet
Title: An HMM-based framework for identity-aware long-term multi-object tracking from sparse and uncertain identification: use case on long-term tracking in livestock
Abstract:
The need for long-term multi-object tracking (MOT) is growing due to the demand for analyzing individual behaviors in videos that span several minutes. Unfortunately, due to identity switches between objects, the tracking performance of existing MOT approaches decreases over time, making them difficult to apply for long-term tracking. However, in many real-world applications, such as in the livestock sector, it is possible to obtain sporadic identifications for some of the animals from sources like feeders. To address the challenges of long-term MOT, we propose a new framework that combines both uncertain identities and tracking using a Hidden Markov Model (HMM) formulation. In addition to providing real-world identities to animals, our HMM framework improves the F1 score of ByteTrack, a leading MOT approach even with re-identification, on a 10 minute pig tracking dataset with 21 identifications at the pen's feeding station. We also show that our approach is robust to the uncertainty of identifications, with performance increasing as identities are provided more frequently. The improved performance of our HMM framework was also validated on the MOT17 and MOT20 benchmark datasets using both ByteTrack and FairMOT. The code for this new HMM framework and the new 10-minute pig tracking video dataset are available at: https://github.com/ngobibibnbe/uncertain-identity-aware-tracking
中文: 提出的隐马尔可夫模型框架通过整合零星的真实身份信息,有效解决了长期多目标跟踪中的身份切换问题,在牲畜追踪和标准数据集上均显著提升了跟踪性能。
English: The proposed Hidden Markov Model framework effectively addresses long-term multi-object tracking challenges by integrating sporadic real-world identifications, significantly improving tracking accuracy on both livestock and benchmark datasets.

Authors:Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool
Title: DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
Abstract:
Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion
Chinese: 提出的DGFusion网络采用深度引导的多模态融合方法,通过深度感知特征和局部深度标记动态调整传感器融合,在复杂数据集上实现了最先进的全景和语义分割性能。
English: The proposed DGFusion network introduces a depth-guided multimodal fusion method that dynamically adapts sensor fusion using depth-aware features and local depth tokens, achieving state-of-the-art panoptic and semantic segmentation performance on challenging datasets.

Authors:Francisco M. López, Miles Lenz, Marco G. Fedozzi, Arthur Aubret, Jochen Triesch
Title: MIMo grows! Simulating body and sensory development in a multimodal infant model
Abstract:
Infancy is characterized by rapid body growth and an explosive change of sensory and motor abilities. However, developmental robots and simulation platforms are typically designed in the image of a specific age, which limits their ability to capture the changing abilities and constraints of developing infants. To address this issue, we present MIMo v2, a new version of the multimodal infant model. It includes a growing body with increasing actuation strength covering the age range from birth to 24 months. It also features foveated vision with developing visual acuity as well as sensorimotor delays modeling finite signal transmission speeds to and from an infant's brain. Further enhancements of this MIMo version include an inverse kinematics module, a random environment generator and updated compatiblity with third-party simulation and learning libraries. Overall, this new MIMo version permits increased realism when modeling various aspects of sensorimotor development. The code is available on the official repository (https://github.com/trieschlab/MIMo).
中文: 新版MIMo v2模型通过整合成长的身体、发展中的视觉敏锐度、感觉运动延迟及改进的模拟工具兼容性,增强了婴儿发育模拟的真实性,覆盖从出生到24个月的年龄段。
English: The new MIMo v2 model enhances realism in infant development simulations by incorporating a growing body, developing visual acuity, sensorimotor delays, and improved compatibility with simulation tools, covering ages from birth to 24 months.

Authors:Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra
Title: A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes
Abstract:
Computational models have emerged as powerful tools for energy modeling research, touting scalability and quantitative results. However, these models require a plethora of data, some of which is inaccessible, expensive, or raises privacy concerns. We introduce a modular multimodal framework to produce this data from publicly accessible residential information and images using generative artificial intelligence (AI). Additionally, we provide a pipeline demonstrating this framework, and we evaluate its generative AI components. Our experiments show that our framework's use of AI avoids common issues with generative models. Our framework produces realistic, labeled data. By reducing dependence on costly or restricted data sources, we pave a path towards more accessible and reproducible research.
中文: 本文提出了一种模块化多模态框架,利用生成式人工智能从公开的住宅信息和图像中生成真实、标注的数据,解决了计算能源建模中数据稀缺、成本高昂和隐私问题,同时提升了研究的可及性和可重复性。
English: This paper introduces a modular multimodal framework that uses generative AI to create realistic, labeled data from publicly accessible residential information and images, addressing the challenges of data scarcity, cost, and privacy in computational energy modeling while enhancing research accessibility and reproducibility.

Authors:Moslem Yazdanpanah, Ali Bahri, Mehrdad Noori, Sahar Dastani, Gustavo Adolfo Vargas Hakim, David Osowiechi, Ismail Ben Ayed, Christian Desrosiers
Title: Purge-Gate: Backpropagation-Free Test-Time Adaptation for Point Clouds Classification via Token Purging
Abstract:
Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PG-SP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3\% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4 times faster and 5.5 times more memory efficient than our baseline, making it suitable for real-world deployment. Code is available at \hyperlink{https://github.com/MosyMosy/Purge-Gate}{https://github.com/MosyMosy/Purge-Gate}
Chinese: 本文提出Token Purging(PG)方法,这是一种无需反向传播的测试时自适应技术,通过过滤受域偏移影响的特征标记来提升3D点云分类性能,在精度和效率上均超越现有方法。
English: This paper introduces Token Purging (PG), a backpropagation-free test-time adaptation method for 3D point cloud classification that removes domain-shifted tokens before attention layers, achieving superior accuracy and efficiency over existing approaches.

Authors:Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu
Title: LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
Abstract:
KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types. Our code is available at https://github.com/MGDDestiny/Lava.
中文:LAVa提出了一个统一的KV缓存压缩框架,通过最小化Transformer残差流中的信息损失,实现了无需训练或组合多种策略的动态层级和注意力头预算分配,并在多个基准测试中展现出卓越性能。
English: LAVa introduces a unified KV cache compression framework that minimizes information loss in Transformer residual streams, enabling dynamic layer and head budget allocation without requiring training or multiple strategies, and achieves superior performance across various benchmarks.

Authors:Leen Daher, Zhaobo Wang, Malcolm Mielle
Title: D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference
Abstract:
Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors' feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to 10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn't overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at https://github.com/Schindler-EPFL-Lab/D-CAT.
中文: 提出的D-CAT框架无需推理时配对传感器数据即可实现跨模态知识迁移,在提升分类性能的同时降低了资源受限环境下的硬件依赖。
English: The proposed D-CAT framework enables cross-modal knowledge transfer without requiring paired sensor data during inference, improving classification performance while reducing hardware dependency in resource-constrained environments.

Authors:Mujie Liu, Chenze Wang, Liping Chen, Nguyen Linh Dan Le, Niharika Tewari, Ting Dang, Jiangang Ma, Feng Xia
Title: Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis
Abstract:
The limited availability of labeled brain network data makes it challenging to achieve accurate and interpretable psychiatric diagnoses. While self-supervised learning (SSL) offers a promising solution, existing methods often rely on augmentation strategies that can disrupt crucial structural semantics in brain graphs. To address this, we propose SAM-BG, a two-stage framework for learning brain graph representations with structural semantic preservation. In the pre-training stage, an edge masker is trained on a small labeled subset to capture key structural semantics. In the SSL stage, the extracted structural priors guide a structure-aware augmentation process, enabling the model to learn more semantically meaningful and robust representations. Experiments on two real-world psychiatric datasets demonstrate that SAM-BG outperforms state-of-the-art methods, particularly in small-labeled data settings, and uncovers clinically relevant connectivity patterns that enhance interpretability. Our code is available at https://github.com/mjliu99/SAM-BG.
中文:提出的SAM-BG框架通过结构语义保持技术改进脑网络表征学习,在标注数据有限的精神疾病分析中实现了更优的诊断准确性和可解释性。
English: The proposed SAM-BG framework uses structural semantic preservation to enhance brain graph representation learning, achieving superior diagnostic accuracy and interpretability in psychiatric analysis with limited labeled data.

Authors:Chunyu Li, Xindi Zheng, Siqi Liu
Title: BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025
Abstract:
Entity linking (EL) for biomedical text is typically benchmarked on English-only corpora with flat mentions, leaving the more realistic scenario of nested and multilingual mentions largely unexplored. We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English & Russian), closing this gap with a lightweight pipeline that keeps the original EL model intact and modifies only three task-aligned components: Two-stage retrieval-ranking. We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning. Boundary cues. In the ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing the encoder with an explicit, language-agnostic span before robustness to overlap and nesting. Dataset augmentation. We also automatically expand the ranking training corpus with three complementary data sources, enhancing coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications. Code are publicly available at https://github.com/Kaggle-Competitions-Code/BioNNE-L.
中文摘要:本研究提出了一种轻量级多语言生物医学嵌套实体链接系统,通过双阶段检索排序、边界标记和数据集增强三项核心改进,在保持原模型不变的情况下获得BioNNE 2025竞赛第三名。
English Summary: The study introduces a lightweight pipeline for multilingual biomedical nested entity linking, achieving third place in the BioNNE 2025 challenge through two-stage retrieval-ranking, boundary cues, and dataset augmentation while keeping the core model unchanged.

Authors:Zhenhua Xu, Xixiang Zhao, Xubin Yue, Shengwei Tian, Changting Lin, Meng Han
Title: CTCC: A Robust and Stealthy Fingerprinting Framework for Large Language Models via Cross-Turn Contextual Correlation Backdoor
Abstract:
The widespread deployment of large language models (LLMs) has intensified concerns around intellectual property (IP) protection, as model theft and unauthorized redistribution become increasingly feasible. To address this, model fingerprinting aims to embed verifiable ownership traces into LLMs. However, existing methods face inherent trade-offs between stealthness, robustness, and generalizability, being either detectable via distributional shifts, vulnerable to adversarial modifications, or easily invalidated once the fingerprint is revealed. In this work, we introduce CTCC, a novel rule-driven fingerprinting framework that encodes contextual correlations across multiple dialogue turns, such as counterfactual, rather than relying on token-level or single-turn triggers. CTCC enables fingerprint verification under black-box access while mitigating false positives and fingerprint leakage, supporting continuous construction under a shared semantic rule even if partial triggers are exposed. Extensive experiments across multiple LLM architectures demonstrate that CTCC consistently achieves stronger stealth and robustness than prior work. Our findings position CTCC as a reliable and practical solution for ownership verification in real-world LLM deployment scenarios. Our code and data are publicly available at .
中文: 本文提出CTCC这一新型规则驱动指纹框架,通过在多轮对话中编码上下文关联来嵌入大语言模型的所有权标识,相比现有方法在隐蔽性和鲁棒性方面表现更优,为实际部署中的知识产权保护提供了可靠解决方案。
English: This paper introduces CTCC, a novel rule-driven fingerprinting framework that embeds ownership traces in large language models by encoding contextual correlations across dialogue turns, achieving superior stealth and robustness compared to existing methods for reliable intellectual property protection.

Authors:Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang
Title: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
Abstract:
Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $μ= 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP. \href{https://github.com/42Shawn/Butterflyquant-llm}{Codes} are available.
中文: ButterflyQuant采用可学习的蝴蝶变换,通过连续参数自适应抑制激活值异常值,在2位量化中相比先前方法显著降低困惑度,且计算开销极小。
English: ButterflyQuant introduces learnable butterfly transforms with continuous parameters to adaptively suppress activation outliers for improved 2-bit quantization, achieving significantly lower perplexity than previous methods with minimal computational overhead.

Authors:Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding
Title: SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Abstract:
Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $π_0$ on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
中文:SimpleVLA-RL是一种高效的强化学习框架,通过增强视觉-语言-动作模型的长期规划能力,在减少对昂贵人工数据依赖的同时实现了最先进的性能表现和更强的泛化能力。
English: SimpleVLA-RL is an efficient reinforcement learning framework that enhances Vision-Language-Action models' long-horizon planning, achieving state-of-the-art performance while reducing reliance on costly human-operated data and improving generalization.

Authors:Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou
Title: Measuring Epistemic Humility in Multimodal Large Language Models
Abstract:
Hallucinations in multimodal large language models (MLLMs) -- where the model generates content inconsistent with the input image -- pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs' ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a "None of the above" option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs -- including both general-purpose and specialized reasoning models -- on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.
Chinese: HumbleBench是一个新的基准测试,通过引入“以上都不是”选项来评估多模态大语言模型拒绝看似合理但错误答案的能力,填补了在安全关键应用中评估认知谦逊和可靠性的重要空白。
English: HumbleBench is a new benchmark designed to assess multimodal large language models' ability to reject plausible but incorrect answers by incorporating a "None of the above" option, addressing the critical gap in evaluating epistemic humility and reliability in safety-critical applications.

Authors:Zakaria El Kassimi, Fares Fourati, Mohamed-Slim Alouini
Title: Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations
Abstract:
We study question answering in the domain of radio regulations, a legally sensitive and high-stakes area. We propose a telecom-specific Retrieval-Augmented Generation (RAG) pipeline and introduce, to our knowledge, the first multiple-choice evaluation set for this domain, constructed from authoritative sources using automated filtering and human validation. To assess retrieval quality, we define a domain-specific retrieval metric, under which our retriever achieves approximately 97% accuracy. Beyond retrieval, our approach consistently improves generation accuracy across all tested models. In particular, while naively inserting documents without structured retrieval yields only marginal gains for GPT-4o (less than 1%), applying our pipeline results in nearly a 12% relative improvement. These findings demonstrate that carefully targeted grounding provides a simple yet strong baseline and an effective domain-specific solution for regulatory question answering. All code and evaluation scripts, along with our derived question-answer dataset, are available at https://github.com/Zakaria010/Radio-RAG.
中文摘要:本研究针对无线电监管领域开发了专用的RAG解决方案,通过领域特定的信息检索实现了97%的检索准确率,并使GPT-4o的生成准确率提升近12%。
English Summary: This research develops a telecom-specific RAG pipeline for radio regulation question answering, achieving 97% retrieval accuracy and nearly 12% generation improvement for GPT-4o through domain-specific grounding.

Authors:Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong, Huan Wang
Title: LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
Abstract:
The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.
中文: LoCoBench是一个专为评估长上下文语言模型在复杂软件开发场景中表现而设计的综合基准,涵盖10种编程语言的8000个测试场景,揭示了当前模型在长代码理解方面存在显著不足。
English: LoCoBench is a comprehensive benchmark designed to evaluate long-context language models in complex software development scenarios, featuring 8,000 scenarios across 10 programming languages and revealing significant performance gaps in current models.

Authors:Sijun Dong, Yuxuan Hu, LiBo Wang, Geng Chen, Xiaoliang Meng
Title: PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection
Abstract:
To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.
中文: PeftCD是一种基于视觉基础模型的高效参数微调变化检测框架,能有效应对伪变化和跨域泛化挑战,在多个数据集上以少量可训练参数实现了最优性能。
English: PeftCD is a parameter-efficient change detection framework leveraging Vision Foundation Models to address pseudo changes and cross-domain generalization, achieving state-of-the-art accuracy across multiple datasets with minimal trainable parameters.

Authors:Akshit Achara, Esther Puyol Anton, Alexander Hammers, Andrew P. King
Title: Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer's Disease Classification
Abstract:
Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer's disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR
中文摘要:本研究揭示了基于磁共振成像的阿尔茨海默病深度学习诊断模型存在与种族和性别相关的捷径学习及人口统计学偏差,可能影响不同人群诊断的公平性。
English Summary: This study demonstrates that deep learning models for Alzheimer's disease diagnosis from MRI scans exhibit shortcut learning and demographic bias related to race and sex, potentially compromising diagnostic fairness across different population groups.

Authors:Sirui Xu, Dongting Li, Yucheng Zhang, Xiyan Xu, Qi Long, Ziyin Wang, Yunzhi Lu, Shuchang Dong, Hezi Jiang, Akshat Gupta, Yu-Xiong Wang, Liang-Yan Gui
Title: InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation
Abstract:
While large-scale human motion capture datasets have advanced human motion generation, modeling and generating dynamic 3D human-object interactions (HOIs) remain challenging due to dataset limitations. Existing datasets often lack extensive, high-quality motion and annotation and exhibit artifacts such as contact penetration, floating, and incorrect hand motions. To address these issues, we introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements. First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations. Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions. Leveraging the principle of contact invariance, we maintain human-object relationships while introducing motion variations, expanding the dataset to 30.70 hours. Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance. Extensive experiments validate the utility of our dataset as a foundational resource for advancing 3D human-object interaction generation. To support continued research in this area, the dataset is publicly available at https://github.com/wzyabcas/InterAct, and will be actively maintained.
中文:InterAct基准通过整合优化21.81小时数据至30.70小时高质量交互,配以详细文本标注,解决了现有三维人物交互数据集的缺陷,同时建立统一生成建模任务并实现最优性能。
English: The InterAct benchmark addresses limitations in existing 3D human-object interaction datasets by consolidating and optimizing 21.81 hours of data into 30.70 hours of high-quality interactions with detailed annotations, while establishing unified generative modeling tasks that achieve state-of-the-art performance.

Authors:Jian Zhu, Xin Zou, Xi Wang, Ning Zhang, Bian Wu, Yao Yang, Ying Zhou, Lingfang Zeng, Chang Tang, Cheng Luo
Title: Generative Diffusion Contrastive Network for Multi-View Clustering
Abstract:
In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to clustering performance. However, there is a problem of low-quality data in multi-view fusion. This problem primarily arises from two reasons: 1) Certain views are contaminated by noisy data. 2) Some views suffer from missing data. This paper proposes a novel Stochastic Generative Diffusion Fusion (SGDF) method to address this problem. SGDF leverages a multiple generative mechanism for the multi-view feature of each sample. It is robust to low-quality data. Building on SGDF, we further present the Generative Diffusion Contrastive Network (GDCN). Extensive experiments show that GDCN achieves the state-of-the-art results in deep MVC tasks. The source code is publicly available at https://github.com/HackerHyper/GDCN.
中文: 本文提出了一种新颖的随机生成扩散融合方法和生成扩散对比网络,通过鲁棒的多视图融合有效解决了多视图聚类中的低质量数据问题,实现了最先进的性能。
English: This paper introduces a novel Stochastic Generative Diffusion Fusion (SGDF) method and the Generative Diffusion Contrastive Network (GDCN) to address low-quality data issues in Multi-View Clustering, achieving state-of-the-art performance through robust multi-view fusion.

Authors:Cynthia Moreira Maia, Lucas B. V. de Amorim, George D. C. Cavalcanti, Rafael M. O. Cruz
Title: PIPES: A Meta-dataset of Machine Learning Pipelines
Abstract:
Solutions to the Algorithm Selection Problem (ASP) in machine learning face the challenge of high computational costs associated with evaluating various algorithms' performances on a given dataset. To mitigate this cost, the meta-learning field can leverage previously executed experiments shared in online repositories such as OpenML. OpenML provides an extensive collection of machine learning experiments. However, an analysis of OpenML's records reveals limitations. It lacks diversity in pipelines, specifically when exploring data preprocessing steps/blocks, such as scaling or imputation, resulting in limited representation. Its experiments are often focused on a few popular techniques within each pipeline block, leading to an imbalanced sample. To overcome the observed limitations of OpenML, we propose PIPES, a collection of experiments involving multiple pipelines designed to represent all combinations of the selected sets of techniques, aiming at diversity and completeness. PIPES stores the results of experiments performed applying 9,408 pipelines to 300 datasets. It includes detailed information on the pipeline blocks, training and testing times, predictions, performances, and the eventual error messages. This comprehensive collection of results allows researchers to perform analyses across diverse and representative pipelines and datasets. PIPES also offers potential for expansion, as additional data and experiments can be incorporated to support the meta-learning community further. The data, code, supplementary material, and all experiments can be found at https://github.com/cynthiamaia/PIPES.git.
Chinese Summary: 针对OpenML在算法选择中存在的流程多样性不足和技术代表性失衡问题,PIPES提出了包含9,408种多样化流程的实验集合,通过在300个数据集上的测试结果,为元学习研究提供了全面可靠的分析基础。
English Summary: To address the limitations of OpenML's limited pipeline diversity and imbalanced technique representation in algorithm selection, PIPES introduces a comprehensive collection of 9,408 diverse pipelines tested on 300 datasets, providing detailed experimental results for robust meta-learning analysis.

Authors:Peisong Wen, Qianqian Xu, Siran Dai, Runmin Cong, Qingming Huang
Title: Semantic Concentration for Self-Supervised Dense Representations Learning
Abstract:
Recent advances in image-level self-supervised learning (SSL) have made significant progress, yet learning dense representations for patches remains challenging. Mainstream methods encounter an over-dispersion phenomenon that patches from the same instance/category scatter, harming downstream performance on dense tasks. This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration. Specifically, the non-strict spatial alignment ensures intra-instance consistency, while shared patterns, i.e., similar parts of within-class instances in the input space, ensure inter-image consistency. Unfortunately, these approaches are infeasible for dense SSL due to their spatial sensitivity and complicated scene-centric data. These observations motivate us to explore explicit semantic concentration for dense SSL. First, to break the strict spatial alignment, we propose to distill the patch correspondences. Facing noisy and imbalanced pseudo labels, we propose a noise-tolerant ranking loss. The core idea is extending the Average Precision (AP) loss to continuous targets, such that its decision-agnostic and adaptive focusing properties prevent the student model from being misled. Second, to discriminate the shared patterns from complicated scenes, we propose the object-aware filter to map the output space to an object-based space. Specifically, patches are represented by learnable prototypes of objects via cross-attention. Last but not least, empirical studies across various tasks soundly support the effectiveness of our method. Code is available in https://github.com/KID-7391/CoTAP.
中文摘要:本研究针对密集自监督学习中的过度分散问题,提出通过带噪声容忍排序损失的块对应蒸馏和对象感知过滤来实现显式语义集中,有效提升了多任务下的表示学习性能。
English Summary: This study addresses the challenge of over-dispersion in dense self-supervised learning by proposing explicit semantic concentration through patch correspondence distillation with noise-tolerant ranking loss and object-aware filtering to enhance representation learning across various tasks.

Authors:Yuchan Jie, Yushen Xu, Xiaosong Li, Fuqiang Zhou, Jianming Lv, Huafeng Li
Title: FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution
Abstract:
As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.
Chinese: FS-Diff是一种创新的联合图像融合与超分辨率方法,通过语义引导和清晰度感知机制生成具有丰富细节和语义信息的高分辨率融合图像,在多个数据集上均优于现有技术。
English: FS-Diff is a novel joint image fusion and super-resolution method that leverages semantic guidance and clarity-aware mechanisms to generate high-resolution fused images with enhanced details and semantic information, outperforming existing techniques across multiple datasets.

Authors:Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra
Title: Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift
Abstract:
Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.
中文摘要:DRiFt框架通过解耦临床相关特征与任务无关噪声,提升了医学视觉语言模型的分布内性能与跨数据集鲁棒性,为临床安全应用提供了更可靠的解决方案。
English Summary: The DRiFt framework enhances medical vision-language models by decoupling clinical features from task-agnostic noise, improving both in-distribution performance and robustness across datasets for safer clinical deployment.

Authors:Harry Mayne, Ryan Othniel Kearns, Yushi Yang, Andrew M. Bean, Eoin Delaney, Chris Russell, Adam Mahdi
Title: LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
Abstract:
To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.
Chinese: 语言模型难以生成有效的自我反事实解释,它们要么做出过多修改而缺乏简洁性,要么改动过小无法改变预测结果,这降低了其在关键决策中作为解释工具的可靠性。
English: Language models struggle to produce effective self-generated counterfactual explanations, as they either make excessive changes that remain valid but not minimal, or overly subtle edits that fail to alter predictions, limiting their reliability for explaining decisions in high-stakes applications.

Authors:Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang
Title: VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
Abstract:
Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.
中文: VLA-Adapter通过轻量级策略模块有效连接视觉语言表征与动作空间,无需大规模预训练或机器人数据即可实现顶尖性能,并能在消费级硬件上快速完成训练。
English: VLA-Adapter introduces a lightweight module that efficiently bridges vision-language representations to action spaces, achieving state-of-the-art performance without large-scale pre-training or robotic data, while enabling rapid training on consumer hardware.

Authors:Dimitrios Anastasiou, Razvan Caramalau, Nazir Sirajudeen, Matthew Boal, Philip Edwards, Justin Collins, John Kelly, Ashwin Sridhar, Maxine Tran, Faiz Mumtaz, Nevil Pavithran, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos
Title: Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment
Abstract:
Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.
中文: 本研究探讨了自监督预训练对小样本手术技能评估的影响,结果表明领域相关数据集优于规模更大但相关性较低的来源,且加入手术特定数据可显著提升性能。
English: This study explores how self-supervised pre-training impacts few-shot surgical skill assessment, demonstrating that domain-relevant datasets outperform larger but less aligned sources and that incorporating procedure-specific data enhances performance.

Authors:Hui Li, Yi You, Qiqi Chen, Bingfeng Zhang, George Q. Huang
Title: Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM
Abstract:
Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users' creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.
中文: 生成式AI革新了服装行业复杂工作流程,将创意轻松转化为设计,但精细定制仍受文本模糊性限制,因此我们提出BUG工作流,通过图像转提示自动生成和微调服装设计,降低设计门槛并释放用户创造力。
English: Generative AI enhances complex workflows in the garment industry by enabling easy transformation of ideas into designs, yet struggles with fine-grained customization due to text ambiguity, leading to the proposed BUG workflow for automated and precise clothing design from conversational inputs.

Authors:Weixing Wei, Kazuyoshi Yoshii
Title: Efficient Transformer-Based Piano Transcription With Sparse Attention Mechanisms
Abstract:
This paper investigates automatic piano transcription based on computationally-efficient yet high-performant variants of the Transformer that can capture longer-term dependency over the whole musical piece. Recently, transformer-based sequence-to-sequence models have demonstrated excellent performance in piano transcription. These models, however, fail to deal with the whole piece at once due to the quadratic complexity of the self-attention mechanism, and music signals are thus typically processed in a sliding-window manner in practice. To overcome this limitation, we propose an efficient architecture with sparse attention mechanisms. Specifically, we introduce sliding-window self-attention mechanisms for both the encoder and decoder, and a hybrid global-local cross-attention mechanism that attends to various spans according to the MIDI token types. We also use a hierarchical pooling strategy between the encoder and decoder to further reduce computational load. Our experiments on the MAESTRO dataset showed that the proposed model achieved a significant reduction in computational cost and memory usage, accelerating inference speed, while maintaining transcription performance comparable to the full-attention baseline. This allows for training with longer audio contexts on the same hardware, demonstrating the viability of sparse attention for building efficient and high-performance piano transcription systems. The code is available at https://github.com/WX-Wei/efficient-seq2seq-piano-trans.
中文: 本文提出了一种基于Transformer的高效自动钢琴转录模型,通过稀疏注意力机制显著降低了计算成本和内存使用,在保持与全注意力模型相当性能的同时,实现了对更长音乐片段的处理能力。
English: This paper introduces an efficient Transformer-based model for automatic piano transcription that uses sparse attention mechanisms to reduce computational costs while maintaining performance comparable to full-attention models, enabling processing of longer musical pieces on the same hardware.

Authors:Illia Volkov, Nikita Kisel, Klara Janouskova, Jiri Matas
Title: Image Recognition with Vision and Language Embeddings of VLMs
Abstract:
Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.
Chinese: 视觉语言模型在零样本图像分类中展现出互补优势,语言引导和纯视觉方法在不同类别上各有所长,据此提出了一种无需学习的融合策略,有效提升了分类性能。
English: Vision-language models demonstrate complementary strengths in zero-shot image classification, with language-guided and vision-only approaches each excelling in different categories, leading to a proposed fusion method that enhances performance without additional training.

Authors:Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang
Title: Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization
Abstract:
Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.
中文摘要:MatCha作为首个材料表征图像理解的基准,揭示了当前多模态大语言模型在需要高级领域知识和视觉分析的复杂任务中,其表现远逊于人类专家。
English Summary: MatCha is introduced as the first benchmark for materials characterization image understanding, revealing that current multimodal large language models significantly underperform human experts in tasks requiring advanced domain knowledge and visual analysis.

Authors:Weige Cai, Tong Zhu, Jinyi Niu, Ruiqi Hu, Lingyao Li, Tenglong Wang, Xiaowu Dai, Weining Shen, Liwen Zhang
Title: LightAgent: Production-level Open-source Agentic AI Framework
Abstract:
With the rapid advancement of large language models (LLMs), Multi-agent Systems (MAS) have achieved significant progress in various application scenarios. However, substantial challenges remain in designing versatile, robust, and efficient platforms for agent deployment. To address these limitations, we propose \textbf{LightAgent}, a lightweight yet powerful agentic framework, effectively resolving the trade-off between flexibility and simplicity found in existing frameworks. LightAgent integrates core functionalities such as Memory (mem0), Tools, and Tree of Thought (ToT), while maintaining an extremely lightweight structure. As a fully open-source solution, it seamlessly integrates with mainstream chat platforms, enabling developers to easily build self-learning agents. We have released LightAgent at \href{https://github.com/wxai-space/LightAgent}{https://github.com/wxai-space/LightAgent}
中文摘要:LightAgent作为一个轻量级开源框架,通过集成记忆、工具和思维树等核心功能,解决了多智能体系统在灵活性与简洁性之间的权衡问题,使开发者能够轻松构建自学习智能体。
English Summary: LightAgent is a lightweight, open-source framework that overcomes the flexibility-simplicity trade-off in multi-agent systems by integrating memory, tools, and Tree of Thought functionalities for easy development of self-learning agents.

Authors:Anthony P. Addison, Felix Wagner, Wentian Xu, Natalie Voets, Konstantinos Kamnitsas
Title: Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training
Abstract:
Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG
中文: 本研究提出了一种改进的U-net架构,通过引入模态无关通道和图像增强策略生成人工MRI对比度,能够在保持解剖真实性的同时,有效分割训练中见过和未见过的脑部病变成像模态。
English: This study introduces a modified U-net architecture with a modality-agnostic pathway and an image augmentation strategy to create artificial MRI contrasts, enabling effective segmentation of brain lesions across both seen and unseen imaging modalities while maintaining anatomical realism.

Authors:Jing Hao, Yuxuan Fan, Yanpeng Sun, Kaixin Guo, Lizhuo Lin, Jinrong Yang, Qi Yong H. Ai, Lun M. Wong, Hao Tang, Kuo Feng Hung
Title: Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis
Abstract:
Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.
中文摘要:大型视觉语言模型在通用医疗任务中表现优异,但在牙科全景X光片解读方面存在局限,为此开发了MMOral数据集和OralGPT模型,通过微调显著提升了诊断性能。
English Summary: Large vision-language models show promise for medical tasks but struggle with dental panoramic X-rays, leading to the creation of MMOral dataset and OralGPT model, which significantly improves diagnostic accuracy through fine-tuning.

Authors:Jiesi Hu, Jianfeng Cao, Yanwu Yang, Chenfei Ye, Yixuan Zhang, Hanyang Peng, Ting Ma
Title: Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement
Abstract:
In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.
中文: Medverse提出了一种通用的3D医学影像上下文学习模型,通过渐进式细化预测和多尺度解剖感知,在多样化任务和解剖区域中实现高保真预测和全局解剖理解,显著优于现有基准模型。
English: Medverse introduces a universal in-context learning model for 3D medical imaging that achieves high-fidelity predictions and global anatomical awareness across diverse tasks and anatomical regions, significantly outperforming existing baselines.

Authors:Chin Yuen Kwok, Jia Qi Yip, Zhen Qiu, Chi Hung Chi, Kwok Yan Lam
Title: Bona fide Cross Testing Reveals Weak Spot in Audio Deepfake Detection Systems
Abstract:
Audio deepfake detection (ADD) models are commonly evaluated using datasets that combine multiple synthesizers, with performance reported as a single Equal Error Rate (EER). However, this approach disproportionately weights synthesizers with more samples, underrepresenting others and reducing the overall reliability of EER. Additionally, most ADD datasets lack diversity in bona fide speech, often featuring a single environment and speech style (e.g., clean read speech), limiting their ability to simulate real-world conditions. To address these challenges, we propose bona fide cross-testing, a novel evaluation framework that incorporates diverse bona fide datasets and aggregates EERs for more balanced assessments. Our approach improves robustness and interpretability compared to traditional evaluation methods. We benchmark over 150 synthesizers across nine bona fide speech types and release a new dataset to facilitate further research at https://github.com/cyaaronk/audio_deepfake_eval.
Chinese Summary: 当前音频深度伪造检测模型的评估因合成器样本不平衡和真实语音多样性不足而存在缺陷,为此我们提出了一种新颖的真实语音交叉测试框架,通过整合多样化数据集和聚合等错误率来提升鲁棒性和可解释性。
English Summary: The current evaluation of audio deepfake detection models is flawed due to imbalanced synthesizer representation and limited bona fide speech diversity, prompting the introduction of a novel bona fide cross-testing framework that enhances robustness and interpretability through diverse datasets and aggregated EERs.

Authors:Yuhao Zhang, Yuhao Du, Zhanchen Dai, Xiangnan Ma, Kaiqi Kou, Benyou Wang, Haizhou Li
Title: EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs
Abstract:
Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.
Chinese: EchoX作为一种新型语音大语言模型,通过融合语义学习和动态生成语音目标来克服声学语义鸿沟,仅用六千小时训练数据就在多个知识问答基准上实现了领先性能。
English: EchoX is a novel speech-to-speech large language model that overcomes the acoustic-semantic gap by integrating semantic learning with dynamically generated speech targets, achieving advanced performance on knowledge-based benchmarks with only six thousand hours of training data.

Authors:Yuiko Uchida, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation
Abstract:
This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on "objects," which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the "objectness" of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.
中文摘要:本文提出OSIM这一面向3D场景的物体中心化评估新指标,通过物体检测模型量化场景中各物体的“物体性”,用户研究表明其比现有指标更符合人类感知,并重新评估了当前主流3D重建与生成模型。
English Summary: This paper introduces OSIM, an object-centric evaluation metric for 3D scenes that aligns more closely with human perception by quantifying objectness through object detection models, as validated by user studies and comparative analyses.

Authors:Liqun He, Jiaqi Xu
Title: Automated Classification of Tutors' Dialogue Acts Using Generative AI: A Case Study Using the CIMA Corpus
Abstract:
This study explores the use of generative AI for automating the classification of tutors' Dialogue Acts (DAs), aiming to reduce the time and effort required by traditional manual coding. This case study uses the open-source CIMA corpus, in which tutors' responses are pre-annotated into four DA categories. Both GPT-3.5-turbo and GPT-4 models were tested using tailored prompts. Results show that GPT-4 achieved 80% accuracy, a weighted F1-score of 0.81, and a Cohen's Kappa of 0.74, surpassing baseline performance and indicating substantial agreement with human annotations. These findings suggest that generative AI has strong potential to provide an efficient and accessible approach to DA classification, with meaningful implications for educational dialogue analysis. The study also highlights the importance of task-specific label definitions and contextual information in enhancing the quality of automated annotation. Finally, it underscores the ethical considerations associated with the use of generative AI and the need for responsible and transparent research practices. The script of this research is publicly available at https://github.com/liqunhe27/Generative-AI-for-educational-dialogue-act-tagging.
本研究证明生成式AI(尤其是GPT-4)能有效自动分类导师对话行为,其高准确率与人工标注高度一致,为教育对话分析提供了高效的手动编码替代方案。
This study demonstrates that generative AI, particularly GPT-4, can effectively automate the classification of tutors' dialogue acts with high accuracy and substantial agreement with human annotations, offering an efficient alternative to manual coding.

Authors:Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura
Title: Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention
Abstract:
Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.
Chinese: ZeroPlantSeg是一种零样本分割方法,通过结合基础分割模型和视觉语言模型,无需训练即可从顶视图中有效分割出完整的莲座状植物个体,其性能优于现有方法。
English: ZeroPlantSeg is a zero-shot method that combines a foundation segmentation model with a vision-language model to effectively segment entire rosette-shaped plant individuals from top-view images, outperforming existing approaches without requiring training data.

Authors:Kelin Ren, Chan-Yang Ju, Dong-Ho Lee
Title: Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation
Abstract:
Multimodal recommendation systems are increasingly becoming foundational technologies for e-commerce and content platforms, enabling personalized services by jointly modeling users' historical behaviors and the multimodal features of items (e.g., visual and textual). However, most existing methods rely on either static fusion strategies or graph-based local interaction modeling, facing two critical limitations: (1) insufficient ability to model fine-grained cross-modal associations, leading to suboptimal fusion quality; and (2) a lack of global distribution-level consistency, causing representational bias. To address these, we propose MambaRec, a novel framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, we introduce the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive loss functions to constrain global modality alignment, enhancing semantic consistency. This dual regularization reduces mode-specific deviations and boosts robustness. To improve scalability, MambaRec employs a dimensionality reduction strategy to lower the computational cost of high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency. Our code has been made publicly available at https://github.com/rkl71/MambaRec.
中文摘要:MambaRec是一种新颖的多模态推荐框架,通过扩张细化注意力模块实现细粒度语义对齐,并结合全局分布正则化增强跨模态融合,在电商数据集上展现出更优的融合质量、泛化能力和计算效率。
English Summary: MambaRec is a novel multimodal recommendation framework that enhances cross-modal fusion through fine-grained semantic alignment using a Dilated Refinement Attention Module and global distribution regularization, demonstrating superior performance in fusion quality, generalization, and efficiency on e-commerce datasets.

Authors:Jianqin Gao, Tianqi Wang, Yu Zhang, Yishu Zhang, Chenyuan Wang, Allan Dong, Zihao Wang
Title: FPI-Det: a face--phone Interaction Dataset for phone-use detection and understanding
Abstract:
The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human--device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.
中文摘要:FPI-Det数据集通过在多场景下提供人脸与手机的同步标注,解决了现有基准在细粒度人机交互检测方面的不足,为手机使用行为识别建立了新标准。
English Summary: The FPI-Det dataset addresses limitations in current benchmarks by providing detailed annotations for detecting phone usage through human-device interactions across various challenging scenarios.

Authors:Jifeng Shen, Haibo Zhan, Xin Zuo, Heng Fan, Xiaohui Yuan, Jun Li, Wankou Yang
Title: IRDFusion: Iterative Relation-Map Difference guided Feature Fusion for Multispectral Object Detection
Abstract:
Current multispectral object detection methods often retain extraneous background or noise during feature fusion, limiting perceptual performance. To address this, we propose an innovative feature fusion framework based on cross-modal feature contrastive and screening strategy, diverging from conventional approaches. The proposed method adaptively enhances salient structures by fusing object-aware complementary cross-modal features while suppressing shared background interference. Our solution centers on two novel, specially designed modules: the Mutual Feature Refinement Module (MFRM) and the Differential Feature Feedback Module (DFFM). The MFRM enhances intra- and inter-modal feature representations by modeling their relationships, thereby improving cross-modal alignment and discriminative power. Inspired by feedback differential amplifiers, the DFFM dynamically computes inter-modal differential features as guidance signals and feeds them back to the MFRM, enabling adaptive fusion of complementary information while suppressing common-mode noise across modalities. To enable robust feature learning, the MFRM and DFFM are integrated into a unified framework, which is formally formulated as an Iterative Relation-Map Differential Guided Feature Fusion mechanism, termed IRDFusion. IRDFusion enables high-quality cross-modal fusion by progressively amplifying salient relational signals through iterative feedback, while suppressing feature noise, leading to significant performance gains. In extensive experiments on FLIR, LLVIP and M$^3$FD datasets, IRDFusion achieves state-of-the-art performance and consistently outperforms existing methods across diverse challenging scenarios, demonstrating its robustness and effectiveness. Code will be available at https://github.com/61s61min/IRDFusion.git.
中文摘要:本文提出IRDFusion框架,通过跨模态特征对比与筛选策略实现自适应特征融合,在抑制背景干扰的同时增强显著目标特征,在多个数据集上达到最优性能。
English Summary: This paper introduces IRDFusion, a novel feature fusion framework that enhances salient object structures through iterative cross-modal feature refinement while suppressing background noise, achieving state-of-the-art performance on multiple datasets.

Authors:Ahmed Adnan, Mushfiqur Rahman, Saad Sakib Noor, Kazi Sakib
Title: CLARA: A Developer's Companion for Code Comprehension and Analysis
Abstract:
Code comprehension and analysis of open-source project codebases is a task frequently performed by developers and researchers. However, existing tools that practitioners use for assistance with such tasks often require prior project setup, lack context-awareness, and involve significant manual effort. To address this, we present CLARA, a browser extension that utilizes a state-of-the-art inference model to assist developers and researchers in: (i) comprehending code files and code fragments, (ii) code refactoring, and (iii) code quality attribute detection. We qualitatively evaluated CLARA's inference model using existing datasets and methodology, and performed a comprehensive user study with 10 developers and academic researchers to assess its usability and usefulness. The results show that CLARA is useful, accurate, and practical in code comprehension and analysis tasks. CLARA is an open-source tool available at https://github.com/SaadNoor555/CLARA_tool_demo. A video showing the full capabilities of CLARA can be found at https://youtu.be/VDKVXvIH41Q?si=qBFsmS_Y4m_9x3YH.
中文:CLARA是一款浏览器扩展,采用先进推理模型帮助开发者进行代码理解、重构和质量检测,经评估和用户研究证明其有效且实用。
English: CLARA is a browser extension that uses an advanced inference model to assist developers in code comprehension, refactoring, and quality detection, proving effective and practical through evaluation and user studies.

Authors:Qiuhui Chen, Xuancheng Yao, Huping Ye, Yi Hong
Title: Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models
Abstract:
Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.
中文: Med3DInsight是一种创新的预训练框架,通过平面切片感知变换器和部分最优传输对齐技术,将3D医学图像编码器与2D多模态大语言模型相结合,在无需人工标注的情况下实现了分割和分类任务的最先进性能。
English: Med3DInsight is a novel pretraining framework that integrates 3D medical image encoders with 2D multimodal large language models through a plane-slice-aware transformer and partial optimal transport alignment, achieving state-of-the-art performance in segmentation and classification tasks without human annotations.

Authors:Piyush Pant
Title: Improving LLM Safety and Helpfulness using SFT and DPO: A Study on OPT-350M
Abstract:
This research investigates the effectiveness of alignment techniques, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and a combined SFT+DPO approach on improving the safety and helpfulness of the OPT-350M language model. Utilizing the Anthropic Helpful-Harmless RLHF dataset, we train and evaluate four models: the base OPT350M, an SFT model, a DPO model, and a model trained with both SFT and DPO. We introduce three key evaluation metrics: Harmlessness Rate (HmR), Helpfulness Rate (HpR), and a Combined Alignment Score (CAS), all derived from reward model outputs. The results show that while SFT outperforms DPO, The combined SFT+DPO model outperforms all others across all metrics, demonstrating the complementary nature of these techniques. Our findings also highlight challenges posed by noisy data, limited GPU resources, and training constraints. This study offers a comprehensive view of how fine-tuning strategies affect model alignment and provides a foundation for more robust alignment pipelines in future work.
中文摘要:本研究表明,结合监督微调(SFT)和直接偏好优化(DPO)的方法在提升语言模型安全性和实用性方面效果最佳,优于单独使用任一技术。
English Summary: This study demonstrates that combining Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) yields the best results in enhancing both safety and helpfulness of language models, outperforming either method used individually.

Authors:Umair Hassan
Title: COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
Abstract:
Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.
中文摘要:COCO-Urdu数据集通过为5.9万张图像提供31.9万条经过质量验证的乌尔都语标注,解决了乌尔都语多模态资源匮乏的问题,成为最大的公开乌尔都语标注数据集,旨在减少视觉语言研究中的语言偏见。
English Summary: The COCO-Urdu dataset addresses the scarcity of Urdu multimodal resources by providing 319,000 quality-validated Urdu captions for 59,000 images, establishing the largest public Urdu captioning dataset to reduce language bias in vision-language research.

Authors:Marianna Nezhurina, Jörg Franke, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev
Title: Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
Abstract:
We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple model (0.13B to 1.7B parameters) and token scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on various standardized benchmarks, our training runs set establishes reference points that enable researchers to assess the sanity and quality of alternative training approaches across scales and datasets. Intermediate checkpoints allow comparison and studying of the training dynamics. The established reference baselines allow training procedures to be compared through their scaling trends, aligning them on a common compute axis. Comparison of open reference datasets reveals that training on NemoTron-CC HQ consistently outperforms other reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to intermediate training checkpoints, the release includes logs, code, and downstream evaluations to simplify reproduction, standardize comparison, and facilitate future research.
中文: 我们推出了open-sci-ref系列密集Transformer模型,作为跨多尺度和数据集的研究基准,评估显示NemoTron-CC HQ数据集训练效果最佳,并发布了代码和日志以简化复现和促进未来研究。
English: We introduce open-sci-ref, a family of dense transformer models trained as research baselines across multiple scales and datasets, with evaluations showing that training on NemoTron-CC HQ consistently outperforms other datasets, and the release includes code and logs to facilitate reproduction and future research.

Authors:Andrew Bell, Yan Kit Choi, Steffen E Petersen, Andrew King, Muhummad Sohaib Nazir, Alistair A Young
Title: Implicit Neural Representations of Intramyocardial Motion and Strain
Abstract:
Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement -- without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets. The code can be found at https://github.com/andrewjackbell/Displacement-INR
中文: 本研究提出了一种基于隐式神经表示的方法,用于从标记MRI中精确量化左心室运动,在英国生物银行数据上实现了卓越的跟踪精度和效率。
English: This study introduces a method using implicit neural representations to accurately quantify left ventricular motion from tagging MRI, achieving superior tracking accuracy and efficiency on UK Biobank data.

Authors:Magdalena Wysocki, Felix Duelmer, Ananya Bal, Nassir Navab, Mohammad Farid Azampour
Title: UltrON: Ultrasound Occupancy Networks
Abstract:
In free-hand ultrasound imaging, sonographers rely on expertise to mentally integrate partial 2D views into 3D anatomical shapes. Shape reconstruction can assist clinicians in this process. Central to this task is the choice of shape representation, as it determines how accurately and efficiently the structure can be visualized, analyzed, and interpreted. Implicit representations, such as SDF and occupancy function, offer a powerful alternative to traditional voxel- or mesh-based methods by modeling continuous, smooth surfaces with compact storage, avoiding explicit discretization. Recent studies demonstrate that SDF can be effectively optimized using annotations derived from segmented B-mode ultrasound images. Yet, these approaches hinge on precise annotations, overlooking the rich acoustic information embedded in B-mode intensity. Moreover, implicit representation approaches struggle with the ultrasound's view-dependent nature and acoustic shadowing artifacts, which impair reconstruction. To address the problems resulting from occlusions and annotation dependency, we propose an occupancy-based representation and introduce \gls{UltrON} that leverages acoustic features to improve geometric consistency in weakly-supervised optimization regime. We show that these features can be obtained from B-mode images without additional annotation cost. Moreover, we propose a novel loss function that compensates for view-dependency in the B-mode images and facilitates occupancy optimization from multiview ultrasound. By incorporating acoustic properties, \gls{UltrON} generalizes to shapes of the same anatomy. We show that \gls{UltrON} mitigates the limitations of occlusions and sparse labeling and paves the way for more accurate 3D reconstruction. Code and dataset will be available at https://github.com/magdalena-wysocki/ultron.
在自由手超声成像中,提出的UltrON方法利用声学特征和基于占据率的表示,通过解决遮挡和标注依赖性问题来增强三维形状重建,无需额外标注成本。
In free-hand ultrasound imaging, the proposed UltrON method uses acoustic features and an occupancy-based representation to enhance 3D shape reconstruction by addressing occlusion and annotation dependency issues without extra labeling costs.

Authors:Nima Karimian Kakolaki
Title: A Comparative Analysis of Identifier Schemes: UUIDv4, UUIDv7, and ULID for Distributed Systems
Abstract:
Distributed systems require robust, scalable identifier schemes to ensure data uniqueness and efficient indexing across multiple nodes. This paper presents a comprehensive analysis of the evolution of distributed identifiers, comparing traditional auto-increment keys with UUIDv4, UUIDv7, and ULIDs. We combine mathematical calculation of collision probabilities with empirical experiments measuring generation speed and network transmission overhead in a simulated distributed environment. Results demonstrate that ULIDs significantly outperform UUIDv4 and UUIDv7, reducing network overhead by 83.7% and increasing generation speed by 97.32%. statistical analysis further shows ULIDs offer a 98.42% lower collision risk compared to UUIDv7, while maintaining negligible collision probabilities even at high generation rates. These findings highlight ULIDs as an optimal choice for high-performance distributed systems, providing efficient, time-ordered, and lexicographically sortable identifiers suitable for scalable applications. All source code, datasets, and analysis scripts utilized in this research are publicly available in our dedicated repository at https://github.com/nimakarimiank/uids-comparison. This repository contains comprehensive documentation of the experimental setup, including configuration files for the distributed environment, producer and consumer implementations, and message broker integration. Additionally, it provides the data scripts and datasets. Researchers and practitioners are encouraged to explore the repository for full reproducibility of the experiments and to facilitate further investigation or extension of the presented work.
中文: 本研究表明,在分布式系统中,ULID 通过显著降低网络开销、提高生成速度并减少碰撞风险,明显优于 UUIDv4 和 UUIDv7,是高性能应用的最佳选择。
English: This study demonstrates that ULIDs significantly outperform UUIDv4 and UUIDv7 in distributed systems by reducing network overhead, increasing generation speed, and lowering collision risks, making them the optimal choice for high-performance applications.

Authors:Puskal Khadka, Rodrigue Rizk, Longwei Wang, KC Santosh
Title: CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision
Abstract:
Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin
中文:CoSwin通过将局部卷积特征与分层注意力相结合来增强视觉Transformer,在小型图像分类基准上凭借有效的局部-全局特征融合实现了更优性能。
English: CoSwin enhances Vision Transformers by integrating localized convolutional features with hierarchical attention, achieving superior performance on small-scale image classification benchmarks through effective local-global feature fusion.

Authors:Lisa Dunlap, Joseph E. Gonzalez, Trevor Darrell, Fabian Caba Heilbron, Josef Sivic, Bryan Russell
Title: Discovering Divergent Representations between Text-to-Image Models
Abstract:
In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, "flames" might appear in one model's outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model's output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon's ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon
中文摘要:本文提出CompCon进化算法,用于发现不同文本到图像模型输出中视觉属性的差异及其触发提示,并通过对比PixArt和Stable Diffusion 3.5等流行模型验证了其有效性。
English Summary: This paper introduces CompCon, an evolutionary algorithm that identifies visual attributes more prevalent in one text-to-image model's outputs than another's and reveals the prompt concepts causing these differences, demonstrated through comparisons of popular models like PixArt and Stable Diffusion 3.5.

Authors:Rogerio Guimaraes, Frank Xiao, Pietro Perona, Markus Marks
Title: Diffusion-Based Action Recognition Generalizes to Untrained Domains
Abstract:
Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: https://www.vision.caltech.edu/actiondiff. Code: https://github.com/frankyaoxiao/ActionDiff
中文: 本文提出了一种利用视觉扩散模型特征并通过变换器聚合的新方法,实现了跨物种、视角和场景的人类水平动作识别,在泛化基准测试中创下了最新最优性能。
English: This paper introduces a novel method using Vision Diffusion Model features aggregated by a transformer to achieve human-like action recognition across species, viewpoints, and contexts, setting new state-of-the-art results in generalization benchmarks.

Authors:Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Title: Recurrence Meets Transformers for Universal Multimodal Retrieval
Abstract:
With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2
中文: ReT-2是一种统一的多模态检索模型,采用带门控机制的循环Transformer动态整合跨模态信息,在多个基准测试中实现最优性能,同时提升效率并改善下游任务表现。
English: ReT-2 is a unified multimodal retrieval model that employs a recurrent Transformer with gating mechanisms to dynamically integrate cross-modal information, achieving state-of-the-art performance across benchmarks while enhancing efficiency and downstream task results.

Authors:Wenqi Marshall Guo, Yiyang Du, Heidi J. S. Tworek, Shan Du
Title: Position: The Pitfalls of Over-Alignment: Overly Caution Health-Related Responses From LLMs are Unethical and Dangerous
Abstract:
Large Language Models (LLMs) are usually aligned with "human values/preferences" to prevent harmful output. Discussions around the alignment of Large Language Models (LLMs) generally focus on preventing harmful outputs. However, in this paper, we argue that in health-related queries, over-alignment-leading to overly cautious responses-can itself be harmful, especially for people with anxiety and obsessive-compulsive disorder (OCD). This is not only unethical but also dangerous to the user, both mentally and physically. We also showed qualitative results that some LLMs exhibit varying degrees of alignment. Finally, we call for the development of LLMs with stronger reasoning capabilities that provide more tailored and nuanced responses to health queries. Warning: This paper contains materials that could trigger health anxiety or OCD. Dataset and full results can be found in https://github.com/weathon/over-alignment.
中文摘要:本文指出大语言模型的过度对齐可能导致对健康查询的过度谨慎回应,这对焦虑症和强迫症患者尤其有害,并呼吁开发具有更强推理能力的模型以提供更细致的回答。
English Summary: This paper argues that over-alignment in LLMs can cause harmful, overly cautious responses to health queries, particularly for individuals with anxiety and OCD, and advocates for models with stronger reasoning to deliver more nuanced answers.

Authors:David Stotko, Reinhard Klein
Title: SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video
Abstract:
The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction by addressing the depth ambiguity problem in monocular video. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of 2.64 while requiring a medium runtime of 30 min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.
中文: 本文提出了一种新颖方法,仅使用单目RGB视频即可实现织物的三维动态场景重建,通过物理模拟和可微分渲染将几何重建与外观估计相结合,从而获得高质量结果。
English: This paper introduces a novel method for 3D dynamic scene reconstruction of fabrics using a single monocular RGB video, combining geometry reconstruction with appearance estimation through physical simulation and differentiable rendering to achieve high-quality results.

Authors:Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou
Title: A Survey of Reinforcement Learning for Large Reasoning Models
Abstract:
In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
中文: 本文综述了强化学习在增强大语言模型推理能力方面的最新进展,探讨了实现人工超智能所面临的挑战与未来发展方向。
English: This paper surveys recent advances in using Reinforcement Learning to enhance reasoning capabilities in Large Language Models, examining challenges and future directions toward achieving Artificial SuperIntelligence.

Authors:Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou
Title: A Survey of Reinforcement Learning for Large Reasoning Models
Abstract:
In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
中文: 本文综述了强化学习在增强大语言模型推理能力方面的最新进展,探讨了实现人工超智能所面临的挑战与未来发展方向。
English: This paper surveys recent advances in using Reinforcement Learning to enhance reasoning capabilities in Large Language Models, examining challenges and future directions toward achieving Artificial SuperIntelligence.

Authors:Hailay Kidu Teklehaymanot, Dren Fazlija, Wolfgang Nejdl
Title: MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
Abstract:
Subword-based tokenization methods often fail to preserve morphological boundaries, a limitation especially pronounced in low-resource, morphologically complex languages such as those written in the Geez script. To address this, we present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into the subword vocabulary. This hybrid segmentation approach combines morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological integrity while maintaining lexical meaning. To tackle resource scarcity, we curate and release manually annotated morpheme data for four Geez script languages and a morpheme-aware vocabulary for two of them. While the proposed tokenization method does not lead to significant gains in automatic translation quality, we observe consistent improvements in intrinsic metrics, MorphoScore, and Boundary Precision, highlighting the value of morphology-aware segmentation in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated datasets and tokenizer will be publicly available to support further research in low-resource, morphologically rich languages. Our code and data are available on GitHub: https://github.com/hailaykidu/MoVoC
中文:MoVoC分词器将形态学分析与子词分割相结合,以保持格厄兹文字语言的词法结构,尽管在翻译质量上提升有限,但在形态学评估指标上展现出持续改进。
English: The MoVoC tokenizer integrates morphological analysis with subword segmentation to preserve linguistic structure in Geez script languages, demonstrating improved morphological metrics despite limited translation gains.

Authors:Tristan Montoya, Andrés M. Rueda-Ramírez, Gregor J. Gassner
Title: Entropy-Stable Discontinuous Spectral-Element Methods for the Spherical Shallow Water Equations in Covariant Form
Abstract:
We introduce discontinuous spectral-element methods of arbitrary order that are well balanced, conservative of mass, and conservative or dissipative of total energy (i.e., a mathematical entropy function) for a covariant flux formulation of the rotating shallow water equations with variable bottom topography on curved manifolds such as the sphere. The proposed methods are based on a skew-symmetric splitting of the tensor divergence in covariant form, which we implement and analyze within a general flux-differencing framework using tensor-product summation-by-parts operators. Such schemes are proven to satisfy semi-discrete mass and energy conservation on general unstructured quadrilateral grids in addition to well balancing for arbitrary continuous bottom topographies, with energy dissipation resulting from a suitable choice of numerical interface flux. Furthermore, the proposed covariant formulation permits an analytical representation of the geometry and associated metric terms while satisfying the aforementioned entropy stability, conservation, and well-balancing properties without the need to approximate the metric terms so as to enforce discrete metric identities. Numerical experiments on cubed-sphere grids are presented in order to verify the schemes' structure-preservation properties as well as to assess their accuracy and robustness within the context of several standard test cases characteristic of idealized atmospheric flows. Our theoretical and numerical results support the further development of the proposed methodology towards a full dynamical core for numerical weather prediction and climate modelling, as well as broader applications to other hyperbolic and advection-dominated systems of partial differential equations on curved manifolds.
中文: 本研究提出了高阶间断谱元法,用于曲面上的旋转浅水方程,确保质量守恒、能量平衡及良好平衡特性,并通过立方球网格数值实验验证了其结构保持性能和精度。
English: This study presents high-order discontinuous spectral-element methods that ensure mass conservation, energy balance, and well-balanced properties for rotating shallow water equations on curved surfaces, validated through numerical experiments on cubed-sphere grids.

Authors:Mikhail Khodak, Min Ki Jung, Brian Wynne, Edmond Chow, Egemen Kolemen
Title: PCGBandit: One-shot acceleration of transient PDE solvers via online-learned preconditioners
Abstract:
Data-driven acceleration of scientific computing workflows has been a high-profile aim of machine learning (ML) for science, with numerical simulation of transient partial differential equations (PDEs) being one of the main applications. The focus thus far has been on methods that require classical simulations to train, which when combined with the data-hungriness and optimization challenges of neural networks has caused difficulties in demonstrating a convincing advantage against strong classical baselines. We consider an alternative paradigm in which the learner uses a classical solver's own data to accelerate it, enabling a one-shot speedup of the simulation. Concretely, since transient PDEs often require solving a sequence of related linear systems, the feedback from repeated calls to a linear solver such as preconditioned conjugate gradient (PCG) can be used by a bandit algorithm to online-learn an adaptive sequence of solver configurations (e.g. preconditioners). The method we develop, PCGBandit, is implemented directly on top of the popular open source software OpenFOAM, which we use to show its effectiveness on a set of fluid and magnetohydrodynamics (MHD) problems.
中文摘要:机器学习通过利用经典求解器自身数据实现一次性加速,PCGBandit方法在OpenFOAM中成功应用于流体和磁流体动力学问题,展示了这种自适应学习范式对科学计算工作流的加速潜力。
English Summary: Machine learning offers a novel approach to accelerate scientific computing by enabling one-shot speedup of numerical simulations through adaptive learning from classical solver data, as demonstrated by the PCGBandit method implemented in OpenFOAM for fluid and MHD problems.

Authors:Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
Title: AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
Abstract:
Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.
中文: AgentGym-RL框架作为一个统一的强化学习平台,通过ScalingInter-RL训练方法在多样化环境中从头训练自主LLM智能体,在平衡探索与利用的同时,在多项任务中展现出卓越性能。
English: The AgentGym-RL framework is introduced as a unified reinforcement learning platform that trains autonomous LLM agents from scratch across diverse environments, incorporating the ScalingInter-RL approach to balance exploration and exploitation while demonstrating superior performance on multiple tasks.

Authors:Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez
Title: Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
Abstract:
We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling
中文: DSM是一种新颖的流式序列到序列方法,通过预对齐多模态流并引入延迟,仅使用解码器模型即可在ASR和TTS等任务中实现最优性能和低延迟。
English: DSM is a novel streaming sequence-to-sequence approach that uses delayed, time-aligned multimodal streams with a decoder-only model to achieve state-of-the-art performance and low latency across tasks like ASR and TTS.

Authors:Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez
Title: Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
Abstract:
We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling
中文: DSM是一种新颖的流式序列到序列方法,通过预对齐多模态流并引入延迟,仅使用解码器模型即可在ASR和TTS等任务中实现最优性能和低延迟。
English: DSM is a novel streaming sequence-to-sequence approach that uses delayed, time-aligned multimodal streams with a decoder-only model to achieve state-of-the-art performance and low latency across tasks like ASR and TTS.

Authors:Marius Dähling, Sebastian Krebs, J. Marius Zöllner
Title: CrowdQuery: Density-Guided Query Module for Enhanced 2D and 3D Detection in Crowded Scenes
Abstract:
This paper introduces a novel method for end-to-end crowd detection that leverages object density information to enhance existing transformer-based detectors. We present CrowdQuery (CQ), whose core component is our CQ module that predicts and subsequently embeds an object density map. The embedded density information is then systematically integrated into the decoder. Existing density map definitions typically depend on head positions or object-based spatial statistics. Our method extends these definitions to include individual bounding box dimensions. By incorporating density information into object queries, our method utilizes density-guided queries to improve detection in crowded scenes. CQ is universally applicable to both 2D and 3D detection without requiring additional data. Consequently, we are the first to design a method that effectively bridges 2D and 3D detection in crowded environments. We demonstrate the integration of CQ into both a general 2D and 3D transformer-based object detector, introducing the architectures CQ2D and CQ3D. CQ is not limited to the specific transformer models we selected. Experiments on the STCrowd dataset for both 2D and 3D domains show significant performance improvements compared to the base models, outperforming most state-of-the-art methods. When integrated into a state-of-the-art crowd detector, CQ can further improve performance on the challenging CrowdHuman dataset, demonstrating its generalizability. The code is released at https://github.com/mdaehl/CrowdQuery.
中文: 本文提出CrowdQuery方法,通过将包含边界框尺寸的物体密度图嵌入对象查询中,有效提升了基于Transformer的检测器在拥挤场景中的2D和3D检测性能,并在多个数据集上实现显著改进。
English: This paper presents CrowdQuery, a novel method that enhances transformer-based detectors by integrating object density maps with bounding box dimensions into object queries, significantly improving 2D and 3D crowd detection performance across multiple datasets.

Authors:Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
Title: X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
Abstract:
Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $θ= 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.
中文: X-Teaming Evolutionary M2S通过语言模型引导的进化自动发现并优化多轮转单轮模板,在GPT-4.1上实现44.8%的成功率,证明结构改进可跨模型迁移,同时强调阈值校准与跨模型评估的重要性。
English: X-Teaming Evolutionary M2S automates the discovery and optimization of multi-turn-to-single-turn templates through language-model-guided evolution, achieving 44.8% success on GPT-4.1 and demonstrating that structural improvements transfer across models while highlighting the need for threshold calibration and cross-model evaluation.

Authors:Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
Title: X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
Abstract:
Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one structured prompt, but prior work relied on a handful of manually written templates. We present X-Teaming Evolutionary M2S, an automated framework that discovers and optimizes M2S templates through language-model-guided evolution. The system pairs smart sampling from 12 sources with an LLM-as-judge inspired by StrongREJECT and records fully auditable logs. Maintaining selection pressure by setting the success threshold to $θ= 0.70$, we obtain five evolutionary generations, two new template families, and 44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of 2,500 trials (judge fixed) shows that structural gains transfer but vary by target; two models score zero at the same threshold. We also find a positive coupling between prompt length and score, motivating length-aware judging. Our results demonstrate that structure-level search is a reproducible route to stronger single-turn probes and underscore the importance of threshold calibration and cross-model evaluation. Code, configurations, and artifacts are available at https://github.com/hyunjun1121/M2S-x-teaming.
中文: X-Teaming Evolutionary M2S通过语言模型引导的进化自动发现并优化多轮转单轮模板,在GPT-4.1上实现44.8%的成功率,证明结构改进可跨模型迁移,同时强调阈值校准与跨模型评估的重要性。
English: X-Teaming Evolutionary M2S automates the discovery and optimization of multi-turn-to-single-turn templates through language-model-guided evolution, achieving 44.8% success on GPT-4.1 and demonstrating that structural improvements transfer across models while highlighting the need for threshold calibration and cross-model evaluation.

Authors:Sike Xiang, Shuang Chen, Amir Atapour-Abarghouei
Title: BcQLM: Efficient Vision-Language Understanding with Distilled Q-Gated Cross-Modal Fusion
Abstract:
As multimodal large language models (MLLMs) advance, their large-scale architectures pose challenges for deployment in resource-constrained environments. In the age of large models, where energy efficiency, computational scalability and environmental sustainability are paramount, the development of lightweight and high-performance models is critical for real-world applications. As such, we propose a lightweight MLLM framework for end-to-end visual question answering. Our proposed approach centres on BreezeCLIP, a compact yet powerful vision-language encoder optimised for efficient multimodal understanding. With only 1.2 billion parameters overall, our model significantly reduces computational cost while achieving performance comparable to standard-size MLLMs. Experiments conducted on multiple datasets further validate its effectiveness in balancing accuracy and efficiency. The modular and extensible design enables generalisation to broader multimodal tasks. The proposed lightweight vision-language framework is denoted as BcQLM (BreezeCLIP-enhanced Q-Gated Multimodal Language Model). It offers a promising path toward deployable MLLMs under practical hardware constraints. The source code is available at https://github.com/thico0224/BcQLM.
中文: 提出的轻量级BcQLM框架通过紧凑型BreezeCLIP编码器,仅用12亿参数就实现了与标准多模态模型相当的性能,为资源受限环境提供了高效的部署方案。
English: The proposed lightweight BcQLM framework with its compact BreezeCLIP encoder achieves performance comparable to standard multimodal models while using only 1.2 billion parameters, offering an efficient solution for deployment in resource-constrained environments.

Authors:Ada Fang, Robert G. Alberstein, Simon Kelow, Frédéric A. Dreyer
Title: Tokenizing Loops of Antibodies
Abstract:
The complementarity-determining regions of antibodies are loop structures that are key to their interactions with antigens, and of high importance to the design of novel biologics. Since the 1980s, categorizing the diversity of CDR structures into canonical clusters has enabled the identification of key structural motifs of antibodies. However, existing approaches have limited coverage and cannot be readily incorporated into protein foundation models. Here we introduce ImmunoGlobulin LOOp Tokenizer, Igloo, a multimodal antibody loop tokenizer that encodes backbone dihedral angles and sequence. Igloo is trained using a contrastive learning objective to map loops with similar backbone dihedral angles closer together in latent space. Igloo can efficiently retrieve the closest matching loop structures from a structural antibody database, outperforming existing methods on identifying similar H3 loops by 5.9\%. Igloo assigns tokens to all loops, addressing the limited coverage issue of canonical clusters, while retaining the ability to recover canonical loop conformations. To demonstrate the versatility of Igloo tokens, we show that they can be incorporated into protein language models with IglooLM and IglooALM. On predicting binding affinity of heavy chain variants, IglooLM outperforms the base protein language model on 8 out of 10 antibody-antigen targets. Additionally, it is on par with existing state-of-the-art sequence-based and multimodal protein language models, performing comparably to models with $7\times$ more parameters. IglooALM samples antibody loops which are diverse in sequence and more consistent in structure than state-of-the-art antibody inverse folding models. Igloo demonstrates the benefit of introducing multimodal tokens for antibody loops for encoding the diverse landscape of antibody loops, improving protein foundation models, and for antibody CDR design.
中文: Igloo是一种多模态抗体环区标记器,能更有效地识别和构建抗体环区结构,提升蛋白质语言模型的性能并优化CDR设计,超越了现有方法的局限性。
English: Igloo is a multimodal antibody loop tokenizer that improves the identification and structural consistency of antibody loops, enhancing protein language models and CDR design beyond traditional methods.

Authors:Stefan Podgorski, Sourav Garg, Mehdi Hosseinzadeh, Lachlan Mares, Feras Dayoub, Ian Reid
Title: TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals
Abstract:
Visual navigation in robotics traditionally relies on globally-consistent 3D maps or learned controllers, which can be computationally expensive and difficult to generalize across diverse environments. In this work, we present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation without requiring 3D maps or pre-trained controllers. Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles. We address key limitations of previous methods by continuously predicting local trajectory using monocular depth and traversability estimation, and incorporating an auto-switching mechanism that falls back to a baseline controller when necessary. The system operates using foundational models, ensuring open-set applicability without the need for domain-specific fine-tuning. We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability. Our approach outperforms existing state-of-the-art methods, offering a more adaptable and effective solution for visual navigation in open-set environments. The source code is made publicly available: https://github.com/podgorki/TANGO.
中文摘要:本研究提出了一种仅使用RGB图像的物体级拓扑导航系统,无需3D地图或预训练控制器即可实现零样本长距离机器人导航,通过全局路径规划与局部轨迹控制的结合,在开放环境中展现出优于现有方法的适应性和有效性。
English Summary: This study introduces a novel RGB-only, object-level topometric navigation system that enables zero-shot, long-range robot navigation without relying on 3D maps or pre-trained controllers, outperforming existing methods through integrated global planning and local control with open-set applicability.

Authors:Zhen Tian, Christos Anagnostopoulos, Qiyuan Wang, Zhiwei Gao
Title: Multi-Modal Robust Enhancement for Coastal Water Segmentation: A Systematic HSV-Guided Framework
Abstract:
Coastal water segmentation from satellite imagery presents unique challenges due to complex spectral characteristics and irregular boundary patterns. Traditional RGB-based approaches often suffer from training instability and poor generalization in diverse maritime environments. This paper introduces a systematic robust enhancement framework, referred to as Robust U-Net, that leverages HSV color space supervision and multi-modal constraints for improved coastal water segmentation. Our approach integrates five synergistic components: HSV-guided color supervision, gradient-based coastline optimization, morphological post-processing, sea area cleanup, and connectivity control. Through comprehensive ablation studies, we demonstrate that HSV supervision provides the highest impact (0.85 influence score), while the complete framework achieves superior training stability (84\% variance reduction) and enhanced segmentation quality. Our method shows consistent improvements across multiple evaluation metrics while maintaining computational efficiency. For reproducibility, our training configurations and code are available here: https://github.com/UofgCoastline/ICASSP-2026-Robust-Unet.
中文: 本文提出Robust U-Net框架,通过HSV颜色监督和多模态约束改进了卫星图像中的海岸线水域分割,实现了更好的训练稳定性和分割质量。
English: This paper introduces Robust U-Net, a framework that enhances coastal water segmentation in satellite imagery through HSV color supervision and multi-modal constraints, achieving improved training stability and segmentation quality.

Authors:Amirali Rayegan, Tim Menzies
Title: Minimal Data, Maximum Clarity: A Heuristic for Explaining Optimization
Abstract:
Efficient, interpretable optimization is a critical but underexplored challenge in software engineering, where practitioners routinely face vast configuration spaces and costly, error-prone labeling processes. This paper introduces EZR, a novel and modular framework for multi-objective optimization that unifies active sampling, learning, and explanation within a single, lightweight pipeline. Departing from conventional wisdom, our Maximum Clarity Heuristic demonstrates that using less (but more informative) data can yield optimization models that are both effective and deeply understandable. EZR employs an active learning strategy based on Naive Bayes sampling to efficiently identify high-quality configurations with a fraction of the labels required by fully supervised approaches. It then distills optimization logic into concise decision trees, offering transparent, actionable explanations for both global and local decision-making. Extensive experiments across 60 real-world datasets establish that EZR reliably achieves over 90% of the best-known optimization performance in most cases, while providing clear, cohort-based rationales that surpass standard attribution-based explainable AI (XAI) methods (LIME, SHAP, BreakDown) in clarity and utility. These results endorse "less but better"; it is both possible and often preferable to use fewer (but more informative) examples to generate label-efficient optimization and explanations in software systems. To support transparency and reproducibility, all code and experimental materials are publicly available at https://github.com/amiiralii/Minimal-Data-Maximum-Clarity.
中文摘要:EZR框架通过主动学习和决策树解释,利用少量但信息丰富的数据实现高效的多目标优化,在保持高性能的同时提供了超越传统方法的可解释性。
English Summary: EZR is a modular framework that achieves efficient multi-objective optimization using minimal but informative data through active learning and decision tree explanations, outperforming standard methods while providing superior interpretability.

Authors:Fanzhen Liu, Alsharif Abuadbba, Kristen Moore, Surya Nepal, Cecile Paris, Jia Wu, Jian Yang, Quan Z. Sheng
Title: Adversarial Attacks Against Automated Fact-Checking: A Survey
Abstract:
In an era where misinformation spreads freely, fact-checking (FC) plays a crucial role in verifying claims and promoting reliable information. While automated fact-checking (AFC) has advanced significantly, existing systems remain vulnerable to adversarial attacks that manipulate or generate claims, evidence, or claim-evidence pairs. These attacks can distort the truth, mislead decision-makers, and ultimately undermine the reliability of FC models. Despite growing research interest in adversarial attacks against AFC systems, a comprehensive, holistic overview of key challenges remains lacking. These challenges include understanding attack strategies, assessing the resilience of current models, and identifying ways to enhance robustness. This survey provides the first in-depth review of adversarial attacks targeting FC, categorizing existing attack methodologies and evaluating their impact on AFC systems. Additionally, we examine recent advancements in adversary-aware defenses and highlight open research questions that require further exploration. Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations in pursuit of preserving high verification accuracy.
中文摘要:本综述首次系统梳理针对事实核查系统的对抗性攻击,分类评估攻击方法及防御机制,强调构建抗干扰核查框架对保障信息验证准确性的紧迫需求。
English Summary: This survey comprehensively reviews adversarial attacks on automated fact-checking systems, analyzing attack methodologies and defenses while highlighting the critical need for more resilient frameworks to maintain verification accuracy.

Authors:Yujian Ma, Jinqiu Sang, Ruizhe Li
Title: Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition
Abstract:
Large pre-trained speech models such as Whisper offer strong generalization but pose significant challenges for resource-efficient adaptation. Low-Rank Adaptation (LoRA) has become a popular parameter-efficient fine-tuning method, yet its underlying mechanisms in speech tasks remain poorly understood. In this work, we conduct the first systematic mechanistic interpretability study of LoRA within the Whisper encoder for speech emotion recognition (SER). Using a suite of analytical tools, including layer contribution probing, logit-lens inspection, and representational similarity via singular value decomposition (SVD) and centered kernel alignment (CKA), we reveal two key mechanisms: a delayed specialization process that preserves general features in early layers before consolidating task-specific information, and a forward alignment, backward differentiation dynamic between LoRA's matrices. Our findings clarify how LoRA reshapes encoder hierarchies, providing both empirical insights and a deeper mechanistic understanding for designing efficient and interpretable adaptation strategies in large speech models. Our code is available at https://github.com/harryporry77/Behind-the-Scenes.
Chinese: 本研究首次对Whisper编码器中低秩适应(LoRA)机制进行系统性解析,揭示了其在语音情感识别任务中先保留通用特征再实现任务专化的延迟特化过程,以及前向对齐与反向微分的矩阵动态如何重构编码器层次结构。
English: This study provides the first mechanistic analysis of Low-Rank Adaptation (LoRA) in Whisper's encoder for speech emotion recognition, revealing how it preserves general features before specializing and operates through forward-backward matrix dynamics to reshape model hierarchies.

Authors:Jinzhong Ning, Paerhati Tulajiang, Yingying Le, Yijia Zhang, Yuanyuan Sun, Hongfei Lin, Haifeng Liu
Title: CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework
Abstract:
Speech Relation Extraction (SpeechRE) aims to extract relation triplets directly from speech. However, existing benchmark datasets rely heavily on synthetic data, lacking sufficient quantity and diversity of real human speech. Moreover, existing models also suffer from rigid single-order generation templates and weak semantic alignment, substantially limiting their performance. To address these challenges, we introduce CommonVoice-SpeechRE, a large-scale dataset comprising nearly 20,000 real-human speech samples from diverse speakers, establishing a new benchmark for SpeechRE research. Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet generation ensemble strategy, leveraging data diversity through diverse element orders during both training and inference, and (2) CNN-based latent relation prediction heads that generate explicit relation prompts to guide cross-modal alignment and accurate triplet generation. Experiments show our approach outperforms state-of-the-art methods, providing both a benchmark dataset and an effective solution for real-world SpeechRE. The source code and dataset are publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.
中文摘要:本文提出了大规模真实语音数据集CommonVoice-SpeechRE和创新框架RPG-MoGe,通过多序生成策略和关系提示机制,有效解决了语音关系抽取中数据不足和语义对齐问题,显著提升了性能。
English Summary: This paper introduces CommonVoice-SpeechRE, a large-scale real-human speech dataset, and proposes the RPG-MoGe framework with multi-order generation and relation prompts to significantly improve speech relation extraction performance.

Authors:Parastoo Pashmchi, Jerome Benoit, Motonobu Kanagawa
Title: kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions
Abstract:
We study a missing-value imputation method, termed kNNSampler, that imputes a given unit's missing response by randomly sampling from the observed responses of the $k$ most similar units to the given unit in terms of the observed covariates. This method can sample unknown missing values from their distributions, quantify the uncertainties of missing values, and be readily used for multiple imputation. Unlike popular kNNImputer, which estimates the conditional mean of a missing response given an observed covariate, kNNSampler is theoretically shown to estimate the conditional distribution of a missing response given an observed covariate. Experiments demonstrate its effectiveness in recovering the distribution of missing values. The code for kNNSampler is made publicly available (https://github.com/SAP/knn-sampler).
Chinese: kNNSampler方法通过从k个最相似单元的观测响应中随机抽样来填补缺失值,能够估计条件分布并量化不确定性,实验证明其在恢复缺失值分布方面具有良好效果。
English: The kNNSampler method imputes missing values by randomly sampling from the k most similar units' observed responses, enabling estimation of conditional distributions and uncertainty quantification, with experiments confirming its effectiveness in recovering missing value distributions.

Authors:Rongsheng Wang, Fenghe Tang, Qingsong Yao, Rui Yan, Xu Zhang, Zhen Huang, Haoran Lai, Zhiyang He, Xiaodong Tao, Zihang Jiang, Shaohua Kevin Zhou
Title: SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training
Abstract:
Medical vision-language pre-training shows great potential in learning representative features from massive paired radiographs and reports. However, in computed tomography (CT) scans, the distribution of lesions which contain intricate structures is characterized by spatial sparsity. Besides, the complex and implicit relationships between different pathological descriptions in each sentence of the report and their corresponding sub-regions in radiographs pose additional challenges. In this paper, we propose a Similarity-Driven Cross-Granularity Pre-training (SimCroP) framework on chest CTs, which combines similarity-driven alignment and cross-granularity fusion to improve radiograph interpretation. We first leverage multi-modal masked modeling to optimize the encoder for understanding precise low-level semantics from radiographs. Then, similarity-driven alignment is designed to pre-train the encoder to adaptively select and align the correct patches corresponding to each sentence in reports. The cross-granularity fusion module integrates multimodal information across instance level and word-patch level, which helps the model better capture key pathology structures in sparse radiographs, resulting in improved performance for multi-scale downstream tasks. SimCroP is pre-trained on a large-scale paired CT-reports dataset and validated on image classification and segmentation tasks across five public datasets. Experimental results demonstrate that SimCroP outperforms both cutting-edge medical self-supervised learning methods and medical vision-language pre-training methods. Codes and models are available at https://github.com/ToniChopp/SimCroP.
中文: SimCroP框架通过相似性驱动对齐和跨粒度融合,提升了胸部CT扫描的医学视觉语言预训练能力,能更有效地解析稀疏病灶和复杂报告关系,在下游任务中表现卓越。
English: The SimCroP framework enhances medical vision-language pre-training for chest CT scans by combining similarity-driven alignment and cross-granularity fusion to better interpret sparse lesions and complex report relationships, achieving superior performance in downstream tasks.

Authors:Yuelin Guo, Haoyu He, Zhiyuan Chen, Zitong Huang, Renhao Lu, Lu Shi, Zejun Wang, Weizhe Zhang
Title: Dual-Thresholding Heatmaps to Cluster Proposals for Weakly Supervised Object Detection
Abstract:
Weakly supervised object detection (WSOD) has attracted significant attention in recent years, as it does not require box-level annotations. State-of-the-art methods generally adopt a multi-module network, which employs WSDDN as the multiple instance detection network module and multiple instance refinement modules to refine performance. However, these approaches suffer from three key limitations. First, existing methods tend to generate pseudo GT boxes that either focus only on discriminative parts, failing to capture the whole object, or cover the entire object but fail to distinguish between adjacent intra-class instances. Second, the foundational WSDDN architecture lacks a crucial background class representation for each proposal and exhibits a large semantic gap between its branches. Third, prior methods discard ignored proposals during optimization, leading to slow convergence. To address these challenges, we first design a heatmap-guided proposal selector (HGPS) algorithm, which utilizes dual thresholds on heatmaps to pre-select proposals, enabling pseudo GT boxes to both capture the full object extent and distinguish between adjacent intra-class instances. We then present a weakly supervised basic detection network (WSBDN), which augments each proposal with a background class representation and uses heatmaps for pre-supervision to bridge the semantic gap between matrices. At last, we introduce a negative certainty supervision loss on ignored proposals to accelerate convergence. Extensive experiments on the challenging PASCAL VOC 2007 and 2012 datasets demonstrate the effectiveness of our framework. We achieve mAP/mCorLoc scores of 58.5%/81.8% on VOC 2007 and 55.6%/80.5% on VOC 2012, performing favorably against the state-of-the-art WSOD methods. Our code is publicly available at https://github.com/gyl2565309278/DTH-CP.
Chinese: 本文提出了一种新颖的弱监督目标检测框架,通过设计热图引导的候选框选择器、增强背景类别表示的检测网络和负确定性监督损失,有效解决了现有方法的三个关键缺陷,在PASCAL VOC数据集上取得了领先性能。
English: This paper introduces a novel weakly supervised object detection framework that addresses key limitations in existing methods by proposing a heatmap-guided proposal selector, an enhanced detection network with background class representation, and a negative certainty supervision loss, achieving state-of-the-art performance on PASCAL VOC datasets.

Authors:Ziyuan Wang, Bin Cheng, Longxiang Yuan, Zhengfeng Ji
Title: FeynmanDD: Quantum Circuit Analysis with Classical Decision Diagrams
Abstract:
Applications of decision diagrams in quantum circuit analysis have been an active research area. Our work introduces FeynmanDD, a new method utilizing standard and multi-terminal decision diagrams for quantum circuit simulation and equivalence checking. Unlike previous approaches that exploit patterns in quantum states and operators, our method explores useful structures in the path integral formulation, essentially transforming the analysis into a counting problem. The method then employs efficient counting algorithms using decision diagrams as its underlying computational engine. Through comprehensive theoretical analysis and numerical experiments, we demonstrate FeynmanDD's capabilities and limitations in quantum circuit analysis, highlighting the value of this new BDD-based approach.
中文: 本文提出FeynmanDD方法,利用决策图通过路径积分形式将量子电路分析转化为计数问题,为量子电路仿真和等价性检查提供了新的基于BDD的有效途径。
English: This paper introduces FeynmanDD, a novel method that applies decision diagrams to quantum circuit simulation and equivalence checking by transforming analysis into a counting problem through the path integral formulation.

Authors:Yisong Zhang, Ran Cheng, Guoxing Yi, Kay Chen Tan
Title: A Systematic Survey on Large Language Models for Evolutionary Optimization: From Modeling to Solving
Abstract:
Large Language Models (LLMs), with their strong understanding and reasoning capabilities, are increasingly being explored for tackling optimization problems, especially in synergy with evolutionary computation. Despite rapid progress, however, the field still lacks a unified synthesis and a systematic taxonomy. This survey addresses this gap by providing a comprehensive review of recent developments and organizing them within a structured framework. We classify existing research into two main stages: LLMs for optimization modeling and LLMs for optimization solving. The latter is further divided into three paradigms according to the role of LLMs in the optimization workflow: LLMs as stand-alone optimizers, low-level LLMs embedded within optimization algorithms, and high-level LLMs for algorithm selection and generation. For each category, we analyze representative methods, distill technical challenges, and examine their interplay with traditional approaches. We also review interdisciplinary applications spanning the natural sciences, engineering, and machine learning. By contrasting LLM-driven and conventional methods, we highlight key limitations and research gaps, and point toward future directions for developing self-evolving agentic ecosystems for optimization. An up-to-date collection of related literature is maintained at https://github.com/ishmael233/LLM4OPT.
中文: 本综述系统梳理了大语言模型在优化问题中的应用,将其分为建模与求解两大阶段,并按照LLMs在优化流程中的角色细分为三种范式,同时分析了与传统方法的结合及未来研究方向。
English: This survey comprehensively reviews how Large Language Models (LLMs) are applied to optimization problems, categorizing their roles in modeling and solving, and analyzing their integration with evolutionary computation while highlighting future research directions.

Authors:Yisong Zhang, Ran Cheng, Guoxing Yi, Kay Chen Tan
Title: A Systematic Survey on Large Language Models for Evolutionary Optimization: From Modeling to Solving
Abstract:
Large Language Models (LLMs), with their strong understanding and reasoning capabilities, are increasingly being explored for tackling optimization problems, especially in synergy with evolutionary computation. Despite rapid progress, however, the field still lacks a unified synthesis and a systematic taxonomy. This survey addresses this gap by providing a comprehensive review of recent developments and organizing them within a structured framework. We classify existing research into two main stages: LLMs for optimization modeling and LLMs for optimization solving. The latter is further divided into three paradigms according to the role of LLMs in the optimization workflow: LLMs as stand-alone optimizers, low-level LLMs embedded within optimization algorithms, and high-level LLMs for algorithm selection and generation. For each category, we analyze representative methods, distill technical challenges, and examine their interplay with traditional approaches. We also review interdisciplinary applications spanning the natural sciences, engineering, and machine learning. By contrasting LLM-driven and conventional methods, we highlight key limitations and research gaps, and point toward future directions for developing self-evolving agentic ecosystems for optimization. An up-to-date collection of related literature is maintained at https://github.com/ishmael233/LLM4OPT.
中文: 本综述系统梳理了大语言模型在优化问题中的应用,将其分为建模与求解两大阶段,并按照LLMs在优化流程中的角色细分为三种范式,同时分析了与传统方法的结合及未来研究方向。
English: This survey comprehensively reviews how Large Language Models (LLMs) are applied to optimization problems, categorizing their roles in modeling and solving, and analyzing their integration with evolutionary computation while highlighting future research directions.

Authors:Long Gao, Yunhe Zhang, Yan Jiang, Weiying Xie, Yunsong Li
Title: Hyperspectral Mamba for Hyperspectral Object Tracking
Abstract:
Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0\% of the AUC score and 96.3\% of the DP@20 score on the HOTC2020 dataset. The code will be released at https://github.com/lgao001/HyMamba.
中文摘要:提出的HyMamba网络通过状态空间模块统一了光谱、跨深度和时间建模,在基准数据集上实现了最先进的性能,推动了高光谱目标跟踪的发展。
English Summary: The proposed HyMamba network advances hyperspectral object tracking by unifying spectral, cross-depth, and temporal modeling through state space modules, achieving state-of-the-art performance on benchmark datasets.

Authors:Seongho Kim, Sejong Ryu, Hyoukjun You, Je Hyeong Hong
Title: GTA-Crime: A Synthetic Dataset and Generation Framework for Fatal Violence Detection with Adversarial Snippet-Level Domain Adaptation
Abstract:
Recent advancements in video anomaly detection (VAD) have enabled identification of various criminal activities in surveillance videos, but detecting fatal incidents such as shootings and stabbings remains difficult due to their rarity and ethical issues in data collection. Recognizing this limitation, we introduce GTA-Crime, a fatal video anomaly dataset and generation framework using Grand Theft Auto 5 (GTA5). Our dataset contains fatal situations such as shootings and stabbings, captured from CCTV multiview perspectives under diverse conditions including action types, weather, time of day, and viewpoints. To address the rarity of such scenarios, we also release a framework for generating these types of videos. Additionally, we propose a snippet-level domain adaptation strategy using Wasserstein adversarial training to bridge the gap between synthetic GTA-Crime features and real-world features like UCF-Crime. Experimental results validate our GTA-Crime dataset and demonstrate that incorporating GTA-Crime with our domain adaptation strategy consistently enhances real world fatal violence detection accuracy. Our dataset and the data generation framework are publicly available at https://github.com/ta-ho/GTA-Crime.
中文: GTA-Crime数据集和生成框架利用《侠盗猎车手5》的合成数据解决了视频中罕见致命事件检测的难题,并通过领域自适应技术提升了在真实场景中的检测准确性。
English: The GTA-Crime dataset and generation framework address the challenge of detecting rare fatal incidents in videos by using synthetic data from Grand Theft Auto 5, enhanced with domain adaptation to improve real-world detection accuracy.

Authors:Jingjing Liu, Yinchao Han, Xianchao Xiu, Jianhua Zhang, Wanquan Liu
Title: Lightweight Deep Unfolding Networks with Enhanced Robustness for Infrared Small Target Detection
Abstract:
Infrared small target detection (ISTD) is one of the key techniques in image processing. Although deep unfolding networks (DUNs) have demonstrated promising performance in ISTD due to their model interpretability and data adaptability, existing methods still face significant challenges in parameter lightweightness and noise robustness. In this regard, we propose a highly lightweight framework based on robust principal component analysis (RPCA) called L-RPCANet. Technically, a hierarchical bottleneck structure is constructed to reduce and increase the channel dimension in the single-channel input infrared image to achieve channel-wise feature refinement, with bottleneck layers designed in each module to extract features. This reduces the number of channels in feature extraction and improves the lightweightness of network parameters. Furthermore, a noise reduction module is embedded to enhance the robustness against complex noise. In addition, squeeze-and-excitation networks (SENets) are leveraged as a channel attention mechanism to focus on the varying importance of different features across channels, thereby achieving excellent performance while maintaining both lightweightness and robustness. Extensive experiments on the ISTD datasets validate the superiority of our proposed method compared with state-of-the-art methods covering RPCANet, DRPCANet, and RPCANet++. The code will be available at https://github.com/xianchaoxiu/L-RPCANet.
Chinese: 本文提出L-RPCANet,一种基于鲁棒主成分分析的高度轻量化框架,通过分层瓶颈结构、降噪模块和通道注意力机制,在红外小目标检测中实现了参数轻量化和噪声鲁棒性的显著提升。
English: This paper introduces L-RPCANet, a highly lightweight and robust framework for infrared small target detection that enhances parameter efficiency and noise resilience through hierarchical bottleneck structures, a noise reduction module, and channel attention mechanisms.

Authors:Paul Curry
Title: The Domain Mixed Unit: A New Neural Arithmetic Layer
Abstract:
The Domain Mixed Unit (DMU) is a new neural arithmetic unit that learns a single parameter gate that mixes between log-space and linear-space representations while performing either addition (DMU add) or subtraction (DMU sub). Two initializations are proposed for the DMU: one covering addition and multiplication, and another covering subtraction and division. The DMU achieves state-of-the-art performance on the NALM Benchmark, a dataset designed to test the ability of neural arithmetic units to generalize arithmetic operations, specifically performing with the highest percentage solved over all seeds on multiplication and division. The DMU will be submitted as a pull request to the open-source NALM benchmark, and its code is available on GitHub at https://github.com/marict/nalm-benchmark
域混合单元(DMU)是一种新型神经算术单元,通过门控机制融合对数空间和线性空间表示,在NALM基准测试中实现了算术泛化的最优性能,其代码已作为开源项目发布。
The Domain Mixed Unit (DMU) is a novel neural arithmetic unit that combines log-space and linear-space representations through a gating mechanism, achieving state-of-the-art performance on the NALM Benchmark for arithmetic generalization and being made available as open-source code.

Authors:Sasan Sharifipour, Constantino Álvarez Casado, Mohammad Sabokrou, Miguel Bordallo López
Title: APML: Adaptive Probabilistic Matching Loss for Robust 3D Point Cloud Reconstruction
Abstract:
Training deep learning models for point cloud prediction tasks such as shape completion and generation depends critically on loss functions that measure discrepancies between predicted and ground-truth point sets. Commonly used functions such as Chamfer Distance (CD), HyperCD, and InfoCD rely on nearest-neighbor assignments, which often induce many-to-one correspondences, leading to point congestion in dense regions and poor coverage in sparse regions. These losses also involve non-differentiable operations due to index selection, which may affect gradient-based optimization. Earth Mover Distance (EMD) enforces one-to-one correspondences and captures structural similarity more effectively, but its cubic computational complexity limits its practical use. We propose the Adaptive Probabilistic Matching Loss (APML), a fully differentiable approximation of one-to-one matching that leverages Sinkhorn iterations on a temperature-scaled similarity matrix derived from pairwise distances. We analytically compute the temperature to guarantee a minimum assignment probability, eliminating manual tuning. APML achieves near-quadratic runtime, comparable to Chamfer-based losses, and avoids non-differentiable operations. When integrated into state-of-the-art architectures (PoinTr, PCN, FoldingNet) on ShapeNet benchmarks and on a spatiotemporal Transformer (CSI2PC) that generates 3D human point clouds from WiFi CSI measurements, APM loss yields faster convergence, superior spatial distribution, especially in low-density regions, and improved or on-par quantitative performance without additional hyperparameter search. The code is available at: https://github.com/apm-loss/apml.
中文摘要:提出的自适应概率匹配损失(APML)通过可微分且计算高效的近似一对一匹配方法,克服了现有点云损失函数的局限性,在多种架构和基准测试中实现了更优性能。
English Summary: The proposed Adaptive Probabilistic Matching Loss (APML) overcomes limitations of existing point cloud loss functions by providing a differentiable, computationally efficient approximation of one-to-one matching, achieving superior performance across various architectures and benchmarks.

Authors:Hyungjin Chung, Hyelin Nam, Jiyeon Kim, Hyojun Go, Byeongjun Park, Junho Kim, Joonseok Lee, Seongsu Ha, Byung-Hoon Kim
Title: Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
Abstract:
Video Large Language Models (VideoLLMs) face a critical bottleneck: increasing the number of input frames to capture fine-grained temporal detail leads to prohibitive computational costs and performance degradation from long context lengths. We introduce Video Parallel Scaling (VPS), an inference-time method that expands a model's perceptual bandwidth without increasing its context window. VPS operates by running multiple parallel inference streams, each processing a unique, disjoint subset of the video's frames. By aggregating the output probabilities from these complementary streams, VPS integrates a richer set of visual information than is possible with a single pass. We theoretically show that this approach effectively contracts the Chinchilla scaling law by leveraging uncorrelated visual evidence, thereby improving performance without additional training. Extensive experiments across various model architectures and scales (2B-32B) on benchmarks such as Video-MME and EventHallusion demonstrate that VPS consistently and significantly improves performance. It scales more favorably than other parallel alternatives (e.g. Self-consistency) and is complementary to other decoding strategies, offering a memory-efficient and robust framework for enhancing the temporal reasoning capabilities of VideoLLMs.
中文: 视频并行扩展(VPS)是一种推理时方法,通过并行处理视频中互不重叠的帧子集并整合输出结果,在不增加计算成本或额外训练的情况下,有效提升了视频大语言模型的时序推理能力。
English: Video Parallel Scaling (VPS) is an inference-time method that enhances VideoLLMs' temporal reasoning by processing disjoint frame subsets in parallel streams and aggregating their outputs, effectively improving performance without increasing computational costs or requiring additional training.

Authors:Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi, Ziwei Liu
Title: 3D and 4D World Modeling: A Survey
Abstract:
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey
中文总结:本综述首次系统梳理了3D与4D世界建模研究,通过建立明确定义和分类框架,并系统总结专用数据集与评估指标,填补了该领域的研究空白。
English Summary: This survey provides the first comprehensive review of 3D and 4D world modeling, establishing clear definitions and a structured taxonomy while systematically analyzing datasets and evaluation metrics to address current research gaps.

Authors:Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
Title: Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Abstract:
Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.
中文: Parallel-R1是首个通过强化学习实现并行思维的框架,采用渐进式训练课程解决冷启动问题,在复杂数学推理任务中显著提升了模型性能。
English: Parallel-R1 is the first reinforcement learning framework that enables parallel thinking in large language models, using a progressive curriculum to overcome training challenges and significantly improve reasoning accuracy on complex math tasks.

Authors:Xin Lai, Junyi Li, Wei Li, Tao Liu, Tianjian Li, Hengshuang Zhao
Title: Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
Abstract:
Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.
Chinese: 近期大型多模态模型的进展使Mini-o3系统通过数十步的深度多轮推理,在复杂视觉搜索任务中实现最优性能,解决了现有方法推理模式单一和交互轮次有限的问题。
English: Recent advances in large multimodal models have enabled Mini-o3 to achieve state-of-the-art performance on challenging visual search tasks through deep, multi-turn reasoning spanning tens of steps, addressing limitations of monotonous reasoning and limited interaction turns in existing approaches.

Authors:Boammani Aser Lompo, Marc Haraoui
Title: Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Abstract:
Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.
Chinese: 针对表格图像视觉推理基准的不足,我们推出了Visual-TableQA——通过经济高效的自动生成流程构建的大规模多模态数据集,有效提升了模型对复杂表格数据的推理能力。
English: To address limitations in visual reasoning benchmarks for table images, we introduce Visual-TableQA, a large-scale multimodal dataset generated through a cost-effective, autonomous pipeline that enhances model performance on complex tabular data.

Authors:Yilun Kuang, Noah Amsel, Sanae Lotfi, Shikai Qiu, Andres Potapczynski, Andrew Gordon Wilson
Title: Customizing the Inductive Biases of Softmax Attention using Structured Matrices
Abstract:
The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention. Finally, we show that MLR attention has promising results for long-range time-series forecasting.
中文: 本研究提出了基于高秩结构化矩阵(如块张量序列和多级低秩矩阵)的新评分函数,以解决标准注意力机制的信息损失和缺乏距离依赖偏置问题,在高效计算下显著提升了高维回归、语言建模及长程预测任务的性能。
English: This work introduces new scoring functions using high-rank structured matrices like Block Tensor-Train and Multi-Level Low Rank to overcome standard attention's limitations of information loss and lack of distance-dependent bias, demonstrating superior performance in high-dimensional regression, language modeling, and long-range forecasting tasks.

Authors:Yuan Pu, Yazhe Niu, Jia Tang, Junyu Xiong, Shuai Hu, Hongsheng Li
Title: One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning
Abstract:
In heterogeneous multi-task decision-making, tasks not only exhibit diverse observation and action spaces but also vary substantially in their underlying complexities. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling a broad and diverse suite of tasks, gradient conflicts and the loss of model plasticity often constrain their sample efficiency. In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process. First, to mitigate the gradient conflicts, we systematically investigate key architectural designs for extending UniZero. Our investigation identifies a Mixture-of-Experts (MoE) architecture as the most effective approach. We demonstrate, both theoretically and empirically, that this architecture alleviates gradient conflicts by routing task-specific representations to specialized sub-networks. This finding leads to our proposed model, \textit{ScaleZero}. Second, to dynamically allocate model capacity throughout the learning process, we introduce an online Dynamic Parameter Scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Evaluations on a diverse set of standard benchmarks (Atari, DMC, Jericho) demonstrate that ScaleZero, utilizing solely online reinforcement learning with one model, performs on par with specialized single-task agents. With the DPS strategy, it remains competitive while using just 71.5% of the environment interactions. These findings underscore the potential of ScaleZero for effective multi-task planning. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.
Chinese: ScaleZero通过采用专家混合架构和动态参数缩放策略,解决了异构多任务决策中的梯度冲突和模型可塑性丧失问题,在减少环境交互的同时实现了与专业单任务智能体相媲美的性能。
English: ScaleZero addresses gradient conflicts and model plasticity loss in heterogeneous multi-task decision-making by employing a Mixture-of-Experts architecture and a Dynamic Parameter Scaling strategy, achieving competitive performance with specialized single-task agents while using fewer environment interactions.

Authors:Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki
Title: Feature Space Analysis by Guided Diffusion Model
Abstract:
One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various attributes in an image are encoded into a feature by the DNN, by generating images whose features are in proximity to that feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP's image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs' feature spaces.
中文: 本文提出一种基于引导扩散的解码器,能生成与用户指定特征高度匹配的图像,无需额外训练即可分析不同深度神经网络的特征空间,揭示其编码的图像属性。
English: This paper introduces a decoder using guided diffusion to generate images matching user-specified features, enabling analysis of DNN feature spaces without additional training and revealing encoded image attributes.

Authors:Kimiaki Shirahama, Miki Yanobu, Kaduki Yamashita, Miho Ohsaki
Title: Feature Space Analysis by Guided Diffusion Model
Abstract:
One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various image attributes are encoded into the user-specified feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP's image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs' feature spaces.
中文: 本文提出一种基于引导扩散的解码器,能生成与用户指定特征高度匹配的图像,无需额外训练即可分析不同深度神经网络的特征空间,揭示其编码的图像属性。
English: This paper introduces a decoder using guided diffusion to generate images matching user-specified features, enabling analysis of DNN feature spaces without additional training and revealing encoded image attributes.

Authors:Tuo Wang, Adithya Kulkarni, Tyler Cody, Peter A. Beling, Yujun Yan, Dawei Zhou
Title: GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models
Abstract:
Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at https://github.com/ODYSSEYWT/GUQ.
Chinese: GENUINE提出了一种基于图增强的大语言模型不确定性估计框架,通过依赖解析树和分层池化建模语义关系,相比现有方法将AUROC提升高达29%,并降低超过15%的校准误差。
English: GENUINE introduces a graph-enhanced uncertainty estimation framework for LLMs that leverages dependency parse trees and hierarchical pooling to model semantic relationships, achieving up to 29% higher AUROC and reducing calibration errors by over 15% compared to existing methods.

Authors:Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye
Title: HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?
Abstract:
Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with multiple golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight the performance gap between open-source models and top students, the strong reasoning abilities of closed-source models, and the remaining room for improvement. HiPhO, a human-aligned Olympiad benchmark for multimodal physical reasoning, is open-source at https://github.com/SciYu/HiPhO with a public leaderboard at https://phyarena.github.io/.
中文: HiPhO推出了首个针对高中物理奥林匹克竞赛的基准测试,具备全面数据、专业人工对齐评估及模型与人类表现直接对比功能,揭示了开源模型与顶尖学生的显著差距,同时凸显了闭源模型强大的推理能力。
English: HiPhO introduces the first benchmark for high school physics Olympiads, featuring comprehensive data, professional human-aligned evaluation, and direct model-to-human performance comparisons, revealing significant gaps between open-source models and top students while highlighting closed-source models' strong reasoning capabilities.

Authors:Decheng Duan, Yingyi Zhang, Jitong Peng, Chengzhi Zhang
Title: SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP
Abstract:
Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP--a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at: https://github.com/AKADDC/SciNLP.
中文摘要:SciNLP推出了首个针对自然语言处理领域的全文本实体与关系抽取专用基准数据集,包含60篇人工标注文献,显著提升了知识图谱构建与下游应用的性能表现。
English Summary: SciNLP introduces the first comprehensive benchmark for full-text entity and relation extraction in NLP research, featuring 60 annotated publications that enable significant performance improvements in knowledge graph construction and downstream applications.

Authors:Maja Schlereth, Moritz Schillinger, Katharina Breininger
Title: Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss
Abstract:
Acquiring images in high resolution is often a challenging task. Especially in the medical sector, image quality has to be balanced with acquisition time and patient comfort. To strike a compromise between scan time and quality for Magnetic Resonance (MR) imaging, two anisotropic scans with different low-resolution (LR) orientations can be acquired. Typically, LR scans are analyzed individually by radiologists, which is time consuming and can lead to inaccurate interpretation. To tackle this, we propose a novel approach for fusing two orthogonal anisotropic LR MR images to reconstruct anatomical details in a unified representation. Our multi-view neural network is trained in a self-supervised manner, without requiring corresponding high-resolution (HR) data. To optimize the model, we introduce a sparse coordinate-based loss, enabling the integration of LR images with arbitrary scaling. We evaluate our method on MR images from two independent cohorts. Our results demonstrate comparable or even improved super-resolution (SR) performance compared to state-of-the-art (SOTA) self-supervised SR methods for different upsampling scales. By combining a patient-agnostic offline and a patient-specific online phase, we achieve a substantial speed-up of up to ten times for patient-specific reconstruction while achieving similar or better SR quality. Code is available at https://github.com/MajaSchle/tripleSR.
中文摘要:本文提出一种自监督神经网络,通过融合两个正交低分辨率磁共振扫描来重建高分辨率解剖细节,在实现与先进方法相当或更优超分辨率性能的同时,将患者特异性重建速度最高提升十倍。
English Summary: This paper introduces a self-supervised neural network that fuses two orthogonal low-resolution MR scans to reconstruct high-resolution anatomical details, achieving comparable or superior super-resolution performance with up to ten times faster patient-specific reconstruction than state-of-the-art methods.

Authors:Shusen Ma, Tianhao Zhang, Qijiu Xia, Yun-Bo Zhao
Title: IBN: An Interpretable Bidirectional-Modeling Network for Multivariate Time Series Forecasting with Variable Missing
Abstract:
Multivariate time series forecasting (MTSF) often faces challenges from missing variables, which hinder conventional spatial-temporal graph neural networks in modeling inter-variable correlations. While GinAR addresses variable missing using attention-based imputation and adaptive graph learning for the first time, it lacks interpretability and fails to capture more latent temporal patterns due to its simple recursive units (RUs). To overcome these limitations, we propose the Interpretable Bidirectional-modeling Network (IBN), integrating Uncertainty-Aware Interpolation (UAI) and Gaussian kernel-based Graph Convolution (GGCN). IBN estimates the uncertainty of reconstructed values using MC Dropout and applies an uncertainty-weighted strategy to mitigate high-risk reconstructions. GGCN explicitly models spatial correlations among variables, while a bidirectional RU enhances temporal dependency modeling. Extensive experiments show that IBN achieves state-of-the-art forecasting performance under various missing-rate scenarios, providing a more reliable and interpretable framework for MTSF with missing variables. Code is available at: https://github.com/zhangth1211/NICLab-IBN.
中文: 提出的可解释双向建模网络(IBN)通过整合不确定性感知插值和高斯图卷积,解决了多元时间序列预测中变量缺失的问题,同时提升了模型可解释性并捕捉双向时间依赖关系。
English: The proposed Interpretable Bidirectional-modeling Network (IBN) overcomes limitations in multivariate time series forecasting by integrating uncertainty-aware interpolation and Gaussian graph convolutions to handle missing variables while improving interpretability and capturing bidirectional temporal patterns.

Authors:Chunhang Zheng, Zichang Ren, Dou Li
Title: SEEC: Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression
Abstract:
Recently, learned image compression has attracted considerable attention due to its superior performance over traditional methods. However, most existing approaches employ a single entropy model to estimate the probability distribution of pixel values across the entire image, which limits their ability to capture the diverse statistical characteristics of different semantic regions. To overcome this limitation, we propose Segmentation-Assisted Multi-Entropy Models for Lossless Image Compression (SEEC). Our framework utilizes semantic segmentation to guide the selection and adaptation of multiple entropy models, enabling more accurate probability distribution estimation for distinct semantic regions. Specifically, SEEC first extracts image features and then applies semantic segmentation to identify different regions, each assigned a specialized entropy model to better capture its unique statistical properties. Finally, a multi-channel discrete logistic mixture likelihood is employed to model the pixel value distributions effectively. Experimental results on benchmark datasets demonstrate that SEEC achieves state-of-the-art compression ratios while introducing only minimal encoding and decoding latency. With superior performance, the proposed model also supports Regions of Interest (ROIs) coding condition on the provided segmentation mask. Our code is available at https://github.com/chunbaobao/SEEC.
中文:提出的SEEC框架通过语义分割为不同图像区域应用多个专用熵模型,以最低延迟实现了最先进的无损图像压缩性能。
English: The proposed SEEC framework enhances lossless image compression by using semantic segmentation to apply multiple specialized entropy models for different image regions, achieving state-of-the-art compression ratios with minimal latency.

Authors:Xiaoming Chen
Title: HYLU: Hybrid Parallel Sparse LU Factorization
Abstract:
This article introduces HYLU, a hybrid parallel LU factorization-based general-purpose solver designed for efficiently solving sparse linear systems (Ax=b) on multi-core shared-memory architectures. The key technical feature of HYLU is the integration of hybrid numerical kernels so that it can adapt to various sparsity patterns of coefficient matrices. Tests on 34 sparse matrices from SuiteSparse Matrix Collection reveal that HYLU outperforms Intel MKL PARDISO in the numerical factorization phase by geometric means of 1.74X (for one-time solving) and 2.26X (for repeated solving). HYLU can be downloaded from https://github.com/chenxm1986/hylu.
中文:HYLU是一种混合并行LU分解通用求解器,通过集成混合数值内核适应系数矩阵的不同稀疏模式,在SuiteSparse测试中数值分解阶段性能以几何平均1.74倍(单次求解)和2.26倍(重复求解)超越Intel MKL PARDISO。
English: HYLU is a hybrid parallel LU factorization solver that efficiently handles sparse linear systems on multi-core architectures by adapting to diverse sparsity patterns, demonstrating performance gains of 1.74X to 2.26X over Intel MKL PARDISO in benchmarks.

Authors:Xixi Wu, Yanchao Tan, Nan Hou, Ruiyang Zhang, Hong Cheng
Title: MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval
Abstract:
Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at https://github.com/WxxShirley/MoLoRAG.
Chinese: MoLoRAG 是一种逻辑感知的检索框架,通过构建页面图结合语义和逻辑相关性,提升了多模态、多页文档的理解能力,在问答准确性和检索精度上显著优于现有方法。
English: MoLoRAG is a logic-aware retrieval framework that enhances multi-modal, multi-page document understanding by combining semantic and logical relevance through a page graph, significantly improving question-answering accuracy and retrieval precision over existing methods.

Authors:Sung Ju Lee, Nam Ik Cho
Title: Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity
Abstract:
Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at https://github.com/thomas11809/SFWMark
中文摘要:提出的埃尔米特对称傅里叶水印(SFW)方法通过保持频率完整性和抗裁剪攻击的中心感知嵌入策略,显著提升了语义水印的鲁棒性,在各种攻击场景下实现了最优的检测性能与图像保真度。
English Summary: The proposed Hermitian Symmetric Fourier Watermarking (SFW) method with center-aware embedding enhances semantic watermarking by preserving frequency integrity and resisting cropping attacks, achieving state-of-the-art robustness and image fidelity across various attack scenarios.

Authors:Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu
Title: $ΔL$ Normalization: Rethink Loss Aggregation in RLVR
Abstract:
We propose $ΔL$ Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed $ΔL$ Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.
中文摘要:本文提出ΔL归一化方法,通过解决强化学习可验证奖励训练中响应长度变化导致的梯度方差问题,提供无偏估计并实现稳定优化,在多种实验设置下均取得优异性能。
English Summary: The paper introduces ΔL Normalization, an unbiased loss aggregation method that minimizes gradient variance in RLVR training by addressing variable response lengths, achieving superior performance across diverse settings.

Authors:Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang
Title: VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
Abstract:
With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64\% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent's rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.
Chinese: 为解决不可信环境中过度执行的风险,本研究提出可信操作系统代理VeriOS-Agent,该代理在正常条件下自主执行操作,在不确定场景中主动询问人类,在不影响正常性能的情况下将逐步成功率提升了20.64%。
English: To address the risks of over-execution in untrustworthy environments, this study introduces VeriOS-Agent, a trustworthy OS agent that autonomously handles normal conditions and proactively queries humans in uncertain scenarios, achieving a 20.64% improvement in step-wise success rate without compromising normal performance.

Authors:Peijin Xie, Shun Qian, Bingquan Liu, Dexin Wang, Lin Sun, Xiangzheng Zhang
Title: TextlessRAG: End-to-End Visual Document RAG by Speech Without Text
Abstract:
Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag
中文摘要:TextlessRAG是首个实现语音直接对文档图像进行问答的端到端框架,通过布局感知重排机制提升检索效果,在消除文本转换步骤的同时显著提高了效率与准确性。
English Summary: TextlessRAG is the first end-to-end framework that enables speech-based question answering directly on document images, eliminating traditional text conversion steps while improving efficiency and accuracy through layout-aware retrieval refinement.

Authors:Kiet T. Nguyen, Chanhuyk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong
Title: Universal Few-Shot Spatial Control for Diffusion Models
Abstract:
Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.
中文摘要:本文提出的通用少样本控制适配器(UFC)能让预训练扩散模型仅需少量标注数据即可快速适应新型空间控制任务,在多种架构上均达到与全监督方法相当的性能表现。
English Summary: The proposed Universal Few-Shot Control (UFC) adapter enables pretrained diffusion models to quickly adapt to new spatial control tasks using minimal training data, achieving performance comparable to fully supervised methods across various architectures.

Authors:Saad Lahlali, Alexandre Fournier Montgieux, Nicolas Granger, Hervé Le Borgne, Quoc Cuong Pham
Title: MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection
Abstract:
Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % \footnote{Code available upon acceptance} Our code is available in our public repository (\href{https://github.com/CEA-LIST/MVAT}{code}).
中文: MVAT是一种新颖的框架,利用时序多视角数据和师生蒸馏范式,无需3D标注即可实现最先进的弱监督3D物体检测。
English: MVAT is a novel framework that utilizes temporal multi-view data and a Teacher-Student distillation paradigm to achieve state-of-the-art weakly supervised 3D object detection without requiring 3D annotations.

Authors:Jeongwoo Na, Jun Kwon, Eunseong Choi, Jongwuk Lee
Title: Multi-view-guided Passage Reranking with Large Language Models
Abstract:
Recent advances in large language models (LLMs) have shown impressive performance in passage reranking tasks. Despite their success, LLM-based methods still face challenges in efficiency and sensitivity to external biases. (1) Existing models rely mostly on autoregressive generation and sliding window strategies to rank passages, which incur heavy computational overhead as the number of passages increases. (2) External biases, such as position or selection bias, hinder the model's ability to accurately represent passages and increase input-order sensitivity. To address these limitations, we introduce a novel passage reranking model, called Multi-View-guided Passage Reranking (MVP). MVP is a non-generative LLM-based reranking method that encodes query-passage information into diverse view embeddings without being influenced by external biases. For each view, it combines query-aware passage embeddings to produce a distinct anchor vector, which is then used to directly compute relevance scores in a single decoding step. In addition, it employs an orthogonal loss to make the views more distinctive. Extensive experiments demonstrate that MVP, with just 220M parameters, matches the performance of much larger 7B-scale fine-tuned models while achieving a 100x reduction in inference latency. Notably, the 3B-parameter variant of MVP achieves state-of-the-art performance on both in-domain and out-of-domain benchmarks. The source code is available at: https://github.com/bulbna/MVP
中文: 针对大型语言模型在段落重排任务中存在的效率低下和对外部偏差敏感的问题,我们提出了一种名为MVP的新型非生成式模型,通过将查询-段落信息编码为多样化视图嵌入并直接计算相关性得分,在显著降低推理延迟的同时实现了最先进的性能。
English: Recent advances in large language models (LLMs) have shown impressive performance in passage reranking but face challenges in efficiency and sensitivity to external biases, prompting the introduction of a novel non-generative model called MVP that encodes query-passage information into diverse view embeddings to compute relevance scores directly, achieving state-of-the-art performance with significantly reduced inference latency.

Authors:Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
Title: MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification
Abstract:
Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch's diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNet-B0, while substantially improving interpretability: MedicalPatchNet demonstrates substantially improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: https://github.com/TruhnLab/MedicalPatchNet
中文:MedicalPatchNet是一种自解释性胸部X光分类模型,在保持与EfficientNet-B0相当性能的同时,通过将决策透明归因于特定图像区域而显著提升可解释性,且无需后处理技术。
English: MedicalPatchNet is a self-explainable chest X-ray classification model that matches EfficientNet-B0's performance while significantly improving interpretability by transparently attributing decisions to specific image regions without post-hoc methods.

Authors:Pooya Khosravi, Kun Han, Anthony T. Wu, Arghavan Rezvani, Zexin Feng, Xiaohui Xie
Title: XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning
Abstract:
Optical Coherence Tomography Angiography (OCTA) and its derived en-face projections provide high-resolution visualization of the retinal and choroidal vasculature, which is critical for the rapid and accurate diagnosis of retinal diseases. However, acquiring high-quality OCTA images is challenging due to motion sensitivity and the high costs associated with software modifications for conventional OCT devices. Moreover, current deep learning methods for OCT-to-OCTA translation often overlook the vascular differences across retinal layers and struggle to reconstruct the intricate, dense vascular details necessary for reliable diagnosis. To overcome these limitations, we propose XOCT, a novel deep learning framework that integrates Cross-Dimensional Supervision (CDS) with a Multi-Scale Feature Fusion (MSFF) network for layer-aware vascular reconstruction. Our CDS module leverages 2D layer-wise en-face projections, generated via segmentation-weighted z-axis averaging, as supervisory signals to compel the network to learn distinct representations for each retinal layer through fine-grained, targeted guidance. Meanwhile, the MSFF module enhances vessel delineation through multi-scale feature extraction combined with a channel reweighting strategy, effectively capturing vascular details at multiple spatial scales. Our experiments on the OCTA-500 dataset demonstrate XOCT's improvements, especially for the en-face projections which are significant for clinical evaluation of retinal pathologies, underscoring its potential to enhance OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring. The code is available at https://github.com/uci-cbcl/XOCT.
中文: 提出的XOCT深度学习框架结合跨维度监督和多尺度特征融合,优化了OCTA成像中分层感知的血管重建,提升了视网膜疾病诊断的可靠性与应用普及性。
English: The proposed XOCT deep learning framework integrates cross-dimensional supervision and multi-scale feature fusion to enhance layer-aware vascular reconstruction in OCTA imaging, improving diagnostic reliability and accessibility for retinal diseases.

Authors:Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Xue Yang, Hongsheng Li
Title: GLEAM: Learning to Match and Explain in Cross-View Geo-Localization
Abstract:
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they merely predict whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.
中文: GLEAM-C是一个基础性跨视角地理定位模型,通过将无人机图像、街景地图、全景视图和地面照片等多模态数据与卫星图像对齐,实现了与现有方法相当的效率与精度;而GLEAM-X则利用多模态大语言模型引入可解释推理机制,解决了传统地理定位方法缺乏解释性的问题。
English: GLEAM-C is a foundational cross-view geo-localization model that unifies multiple views and modalities by aligning them with satellite imagery, achieving efficiency and accuracy comparable to prior methods, while GLEAM-X introduces explainable reasoning through multimodal large language models to address interpretability in geo-localization.

Authors:Xudong Lu, Zhi Zheng, Yi Wan, Yongxiang Yao, Annan Wang, Renrui Zhang, Panwang Xia, Qiong Wu, Qingyun Li, Weifeng Lin, Xiangyu Zhao, Peifeng Ma, Xue Yang, Hongsheng Li
Title: GLEAM: Learning to Match and Explain in Cross-View Geo-Localization
Abstract:
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they only determine whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.
中文: GLEAM-C是一个基础性跨视角地理定位模型,通过将无人机图像、街景地图、全景视图和地面照片等多模态数据与卫星图像对齐,实现了与现有方法相当的效率与精度;而GLEAM-X则利用多模态大语言模型引入可解释推理机制,解决了传统地理定位方法缺乏解释性的问题。
English: GLEAM-C is a foundational cross-view geo-localization model that unifies multiple views and modalities by aligning them with satellite imagery, achieving efficiency and accuracy comparable to prior methods, while GLEAM-X introduces explainable reasoning through multimodal large language models to address interpretability in geo-localization.

Authors:Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Chenyang Zhao, Zhongwei Wan, Chaofan Tao, Wendong Xu, Hui Shen, Chengming Li, Lingpeng Kong, Ngai Wong
Title: LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction
Abstract:
Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at https://github.com/LongEmotion/LongEmotion, and the project page can be found at https://longemotion.github.io/.
中文: LongEmotion基准通过引入多样化任务和长输入,弥补了长上下文情感智能评估的不足,并证明检索增强生成与协作情感建模方法能有效提升实际场景下的性能表现。
English: The LongEmotion benchmark addresses gaps in evaluating emotional intelligence in long-context scenarios by introducing diverse tasks with extended inputs and demonstrating that Retrieval-Augmented Generation and Collaborative Emotional Modeling methods significantly enhance performance under realistic constraints.

Authors:Yi-Jie Cheng, Oscar Chew, Yun-Nung Chen
Title: The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering
Abstract:
Integrating knowledge graphs (KGs) into the reasoning processes of large language models (LLMs) has emerged as a promising approach to mitigate hallucination. However, existing work in this area often relies on proprietary or extremely large models, limiting accessibility and scalability. In this study, we investigate the capabilities of existing integration methods for small language models (SLMs) in KG-based question answering and observe that their performance is often constrained by their limited ability to traverse and reason over knowledge graphs. To address this limitation, we propose leveraging simple and efficient exploration modules to handle knowledge graph traversal in place of the language model itself. Experiment results demonstrate that these lightweight modules effectively improve the performance of small language models on knowledge graph question answering tasks. Source code: https://github.com/yijie-cheng/SLM-ToG/.
Chinese: 本研究提出采用轻量级探索模块处理知识图谱遍历任务,有效提升了小型语言模型在知识图谱问答中的性能表现,克服了其固有推理能力限制。
English: This study proposes using lightweight exploration modules to enhance small language models' performance in knowledge graph question answering by handling graph traversal, overcoming their inherent limitations in reasoning over complex data structures.

Authors:Ze Sheng, Qingxiao Xu, Jianwei Huang, Matthew Woodcock, Heqing Huang, Alastair F. Donaldson, Guofei Gu, Jeff Huang
Title: All You Need Is A Fuzzing Brain: An LLM-Powered System for Automated Vulnerability Detection and Patching
Abstract:
Our team, All You Need Is A Fuzzing Brain, was one of seven finalists in DARPA's Artificial Intelligence Cyber Challenge (AIxCC), placing fourth in the final round. During the competition, we developed a Cyber Reasoning System (CRS) that autonomously discovered 28 security vulnerabilities - including six previously unknown zero-days - in real-world open-source C and Java projects, and successfully patched 14 of them. The complete CRS is open source at https://github.com/o2lab/afc-crs-all-you-need-is-a-fuzzing-brain. This paper provides a detailed technical description of our CRS, with an emphasis on its LLM-powered components and strategies. Building on AIxCC, we further introduce a public leaderboard for benchmarking state-of-the-art LLMs on vulnerability detection and patching tasks, derived from the AIxCC dataset. The leaderboard is available at https://o2lab.github.io/FuzzingBrain-Leaderboard/.
中文: 我们的团队在DARPA的AIxCC竞赛中获得第四名,开发的开源网络推理系统自主发现了28个安全漏洞(含6个零日漏洞)并修复了其中14个,同时建立了用于评估大语言模型安全任务性能的公开排行榜。
English: Our team placed fourth in DARPA's AIxCC competition by developing an open-source Cyber Reasoning System that autonomously discovered 28 vulnerabilities—including six zero-days—and patched 14 of them, while also establishing a public leaderboard for benchmarking LLMs on security tasks.

Authors:Erencem Ozbey, Dimitrios I. Diochnos
Title: Dimensionally Reduced Open-World Clustering: DROWCULA
Abstract:
Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, `open-world' context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time. The code is available at: https://github.com/DROWCULA/DROWCULA.
中文: 本文提出了一种完全无监督的方法,通过结合视觉变换器进行聚类和流形学习优化嵌入,在多个数据集上实现了最先进的图像分类新类别发现效果,无论聚类数量是否已知。
English: This paper introduces a fully unsupervised method for identifying novel classes in image classification by combining Vision Transformers for clustering with manifold learning to enhance embeddings, achieving state-of-the-art results across multiple datasets even when the cluster count is unknown.

Authors:Heng Hao, Wenjun Hu, Oxana Verkholyak, Davoud Ataee Tarzanagh, Baruch Gutow, Sima Didari, Masoud Faraki, Hankyu Moon, Seungjai Min
Title: PaVeRL-SQL: Text-to-SQL via Partial-Match Rewards and Verbal Reinforcement Learning
Abstract:
Text-to-SQL models allow users to interact with a database more easily by generating executable SQL statements from natural-language questions. Despite recent successes on simpler databases and questions, current Text-to-SQL methods still suffer from low execution accuracy on industry-scale databases and complex questions involving domain-specific business logic. We present \emph{PaVeRL-SQL}, a framework that combines \emph{Partial-Match Rewards} and \emph{Verbal Reinforcement Learning} to drive self-improvement in reasoning language models (RLMs) for Text-to-SQL. To handle practical use cases, we adopt two pipelines: (1) a newly designed in-context learning framework with group self-evaluation (verbal-RL), using capable open- and closed-source large language models (LLMs) as backbones; and (2) a chain-of-thought (CoT) RL pipeline with a small backbone model (OmniSQL-7B) trained with a specially designed reward function and two-stage RL. These pipelines achieve state-of-the-art (SOTA) results on popular Text-to-SQL benchmarks -- Spider, Spider 2.0, and BIRD. For the industrial-level Spider2.0-SQLite benchmark, the verbal-RL pipeline achieves an execution accuracy 7.4\% higher than SOTA, and the CoT pipeline is 1.4\% higher. RL training with mixed SQL dialects yields strong, threefold gains, particularly for dialects with limited training data. Overall, \emph{PaVeRL-SQL} delivers reliable, SOTA Text-to-SQL under realistic industrial constraints. The code is available at https://github.com/PaVeRL-SQL/PaVeRL-SQL.
中文:PaVeRL-SQL框架通过结合部分匹配奖励和语言强化学习,有效提升了工业级复杂数据库的Text-to-SQL性能,在主流基准测试中取得最优结果。
English: The PaVeRL-SQL framework enhances Text-to-SQL performance for complex industrial databases by integrating partial-match rewards and verbal reinforcement learning, achieving state-of-the-art results on major benchmarks.

Authors:Zhiyin Tan, Jennifer D'Souza
Title: Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models
Abstract:
This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.
中文摘要:本研究提出了一种基于大语言模型的主题模型自动评估框架,通过可解释的多维度指标弥补传统方法的不足,有效识别主题冗余和语义偏移等关键缺陷。
English Summary: This study introduces an LLM-based framework for automated topic model evaluation, addressing the limitations of traditional metrics by providing interpretable, multi-dimensional assessments that reveal critical weaknesses like redundancy and semantic drift.

Authors:Ziheng Chen, Xiao-Jun Wu, Bernhard Schölkopf, Nicu Sebe
Title: Riemannian Batch Normalization: A Gyro Approach
Abstract:
Normalization layers are crucial for deep learning, but their Euclidean formulations are inadequate for data on manifolds. On the other hand, many Riemannian manifolds in machine learning admit gyro-structures, enabling principled extensions of Euclidean neural networks to non-Euclidean domains. Inspired by this, we introduce GyroBN, a principled Riemannian batch normalization framework for gyrogroups. We establish two necessary conditions, namely \emph{pseudo-reduction} and \emph{gyroisometric gyrations}, that guarantee GyroBN with theoretical control over sample statistics, and show that these conditions hold for all known gyrogroups in machine learning. Our framework also incorporates several existing Riemannian normalization methods as special cases. We further instantiate GyroBN on seven representative geometries, including the Grassmannian, five constant curvature spaces, and the correlation manifold, and derive novel gyro and Riemannian structures to enable these instantiations. Experiments across these geometries demonstrate the effectiveness of GyroBN. The code is available at https://github.com/GitZH-Chen/GyroBN.git.
Chinese: GyroBN是一种基于陀螺群的黎曼批量归一化框架,可将神经网络扩展至非欧几里得空间,具备理论保证并在多种几何结构上验证了有效性。
English: GyroBN is a principled Riemannian batch normalization framework for gyrogroups that extends neural networks to non-Euclidean domains, with theoretical guarantees and experimental validation across multiple geometries.

Authors:Sergey Pozdnyakov, Philippe Schwaller
Title: Lookup multivariate Kolmogorov-Arnold Networks
Abstract:
High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce a general drop-in replacement, lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0x while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10x higher H100 throughput at equal accuracy. Within frameworks of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6-2.1x and by 1.7x on the CIFAR-10 and ImageNet-1k datasets, respectively. Our code, including dedicated CUDA kernels, is available online at https://github.com/schwallergroup/lmkan.
中文:提出的查找多元柯尔莫哥洛夫-阿诺德网络(lmKANs)通过显著降低计算成本,同时在多个基准测试中保持或提升模型性能,为传统线性层提供了更优的替代方案。
English: The proposed lookup multivariate Kolmogorov-Arnold Networks (lmKANs) provide a superior alternative to traditional linear layers by significantly reducing computational costs while maintaining or enhancing model performance across various benchmarks.

Authors:Kapil Madan
Title: ArGen: Auto-Regulation of Generative AI via GRPO and Policy-as-Code
Abstract:
This paper introduces ArGen (Auto-Regulation of Generative AI systems), a framework for aligning Large Language Models (LLMs) with complex sets of configurable, machine-readable rules spanning ethical principles, operational safety protocols, and regulatory compliance standards. Moving beyond just preference-based alignment, ArGen is designed to ensure LLMs adhere to these multifaceted policies through a novel synthesis of principle-based automated reward scoring, Group Relative Policy Optimisation (GRPO), and an Open Policy Agent (OPA) inspired governance layer. This approach provides the technical foundation for achieving and demonstrating compliance with diverse and nuanced governance requirements. To showcase the framework's capability to operationalize a deeply nuanced and culturally-specific value system, we present an in-depth case study: the development of a medical AI assistant guided by principles from Dharmic ethics (such as Ahimsa and Dharma), as derived from texts like the Bhagavad Gita. This challenging application demonstrates ArGen's adaptability, achieving a 70.9% improvement in domain-scope adherence over the baseline. Through our open-source repository, we show that ArGen's methodology offers a path to 'Governable Al' systems that are technically proficient, ethically robust, and verifiably compliant for safe deployment in diverse global contexts.
中文: ArGen框架通过自动奖励评分、GRPO和治理层,使大型语言模型遵循复杂可配置的伦理、安全和法规规则,并以基于达摩伦理的医疗AI案例展示了70.9%的领域依从性提升。
English: ArGen is a framework that aligns Large Language Models with complex, configurable rules for ethical, safety, and regulatory compliance through automated reward scoring, GRPO, and a governance layer, demonstrating a 70.9% improvement in adherence via a case study on a medical AI guided by Dharmic ethics.

Authors:Yingsheng Wang, Shuo Lu, Jian Liang, Aihua Zheng, Ran He
Title: Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection
Abstract:
Out-of-distribution (OOD) detection helps models identify data outside the training categories, crucial for security applications. While feature-based post-hoc methods address this by evaluating data differences in the feature space without changing network parameters, they often require access to training data, which may not be suitable for some data privacy scenarios. This may not be suitable in scenarios where data privacy protection is a concern. In this paper, we propose a simple yet effective post-hoc method, termed Classifier-based Feature Reconstruction (ClaFR), from the perspective of subspace projection. It first performs an orthogonal decomposition of the classifier's weights to extract the class-known subspace, then maps the original data features into this subspace to obtain new data representations. Subsequently, the OOD score is determined by calculating the feature reconstruction error of the data within the subspace. Compared to existing OOD detection algorithms, our method does not require access to training data while achieving leading performance on multiple OOD benchmarks. Our code is released at https://github.com/Aie0923/ClaFR.
Chinese: 提出的基于分类器的特征重构方法通过子空间投影和特征重构误差,在不访问训练数据的情况下实现分布外检测,既保护了数据隐私又达到了领先性能。
English: The proposed Classifier-based Feature Reconstruction (ClaFR) method enables out-of-distribution detection without accessing training data by utilizing subspace projection and feature reconstruction error, achieving state-of-the-art performance while addressing privacy concerns.

Authors:Cedric Caruzzo, Jong Chul Ye
Title: CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis
Abstract:
Large-scale biological discovery requires integrating massive, heterogeneous datasets like those from the JUMP Cell Painting consortium, but technical batch effects and a lack of generalizable models remain critical roadblocks. To address this, we introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology that are robust to batch effects. Unlike traditional methods that require retraining on new data, CellPainTR's design, featuring source-specific context tokens, allows for effective out-of-distribution (OOD) generalization to entirely unseen datasets without fine-tuning. We validate CellPainTR on the large-scale JUMP dataset, where it outperforms established methods like ComBat and Harmony in both batch integration and biological signal preservation. Critically, we demonstrate its robustness through a challenging OOD task on the unseen Bray et al. dataset, where it maintains high performance despite significant domain and feature shifts. Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
中文摘要:为解决批次效应问题,我们开发了基于Transformer的CellPainTR模型,它能学习通用的细胞形态表征,无需重新训练即可实现优异的批次整合和跨数据集泛化能力。
English Summary: To overcome batch effects and enable robust biological discovery, we developed CellPainTR, a Transformer model that learns generalized cellular morphology representations, achieving superior batch integration and out-of-distribution generalization without retraining.

Authors:Jiajun Chai, Guojun Yin, Zekun Xu, Chuhuai Yue, Yi Jia, Siyu Xia, Xiaohan Wang, Jiwen Jiang, Xiaoguang Li, Chengqi Dong, Hang He, Wei Lin
Title: RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use
Abstract:
Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools. We present RLFactory, a plug-and-play reinforcement learning post-training framework for multi-round tool use. RLFactory tackles (i) tool-call stability and adaptability amid tool heterogeneity and interface issues via an asyncio-based asynchronous caller and a decoupled tool/training architecture, and (ii) diverse evaluation needs via a reward layer supporting rule-based, model-judgment, and tool-verification signals. It reconstructs the MDP by introducing observation markers from tool feedback, closing the loop among model, tools, and environment, and implements a generate-parse-invoke-update workflow for dynamic policy optimization. On Search-R1 with Qwen3-4B, RLFactory achieves a 0.486 test score on the Natural Questions (NQ) dataset, surpassing larger models trained with similar techniques (e.g., Qwen2.5-7B-Instruct-GRPO at 0.473), and increases training throughput by 6.8x. RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios. Code: https://github.com/Simple-Efficient/RL-Factory.
中文:RLFactory是一个即插即用的强化学习框架,通过异步调用器和解耦架构提升大语言模型在多轮工具使用中的稳定性和适应性,并利用灵活奖励层支持多样化评估,在基准测试中实现了更优的性能和效率。
English: RLFactory is a plug-and-play reinforcement learning framework that enhances large language models' multi-round tool use by improving tool-call stability and adaptability through asynchronous calling and a decoupled architecture, while supporting diverse evaluations with a flexible reward layer, achieving superior performance and efficiency on benchmark tests.

Authors:Zehua Li
Title: Toward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification
Abstract:
This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems deployed on CPU, GPU, and compiled runtimes. The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol covering tensor-level closeness, activation alignment, and task-level metrics. Through 672 checks across multiple models and tolerance settings, we observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Our results show that detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. We further demonstrate that deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss. To our knowledge, this is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.
中文: 本文提出了一种配置优先的框架,系统性地评估并缓解深度学习系统中的跨后端兼容性问题,采用三层验证协议,并证明确定性适配器能显著提高不同运行时环境间的一致性。
English: This paper introduces a configuration-first framework that systematically evaluates and mitigates cross-backend compatibility issues in deep learning systems, employing a three-tier verification protocol and demonstrating that deterministic adapters can significantly improve agreement across diverse runtimes.

Authors:Yu Song, Zhigang Hua, Yan Xie, Jingzhe Liu, Bo Long, Hui Liu
Title: GSTBench: A Benchmark Study on the Transferability of Graph Self-Supervised Learning
Abstract:
Self-supervised learning (SSL) has shown great promise in graph representation learning. However, most existing graph SSL methods are developed and evaluated under a single-dataset setting, leaving their cross-dataset transferability largely unexplored and limiting their ability to leverage knowledge transfer and large-scale pretraining, factors that are critical for developing generalized intelligence beyond fitting training data. To address this gap and advance foundation model research for graphs, we present GSTBench, the first systematic benchmark for evaluating the transferability of graph SSL methods. We conduct large-scale pretraining on ogbn-papers100M and evaluate five representative SSL methods across a diverse set of target graphs. Our standardized experimental setup decouples confounding factors such as model architecture, dataset characteristics, and adaptation protocols, enabling rigorous comparisons focused solely on pretraining objectives. Surprisingly, we observe that most graph SSL methods struggle to generalize, with some performing worse than random initialization. In contrast, GraphMAE, a masked autoencoder approach, consistently improves transfer performance. We analyze the underlying factors that drive these differences and offer insights to guide future research on transferable graph SSL, laying a solid foundation for the "pretrain-then-transfer" paradigm in graph learning. Our code is available at https://github.com/SongYYYY/GSTBench.
中文: GSTBench是首个评估图自监督学习方法可迁移性的基准,发现除GraphMAE外多数方法难以泛化,其持续提升性能的表现为未来研究提供了重要洞见。
English: GSTBench is the first benchmark for evaluating the transferability of graph self-supervised learning methods, revealing that most struggle to generalize except for GraphMAE, which consistently improves performance and provides insights for future research.

Authors:Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, Nicu Sebe
Title: H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers
Abstract:
Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a hierarchical plug-and-play pruning-and-recovering framework, called Hierarchical Hourglass Tokenizer (H$_{2}$OT), for efficient transformer-based 3D human pose estimation from videos. H$_{2}$OT begins with progressively pruning pose tokens of redundant frames and ends with recovering full-length sequences, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. It works with two key modules, namely, a Token Pruning Module (TPM) and a Token Recovering Module (TRM). TPM dynamically selects a few representative tokens to eliminate the redundancy of video frames, while TRM restores the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines while effectively accommodating different token pruning and recovery strategies. In addition, our H$_{2}$OT reveals that maintaining the full pose sequence is unnecessary, and a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy. Extensive experiments on multiple benchmark datasets demonstrate both the effectiveness and efficiency of the proposed method. Code and models are available at https://github.com/NationalGAILab/HoT.
中文: 本文提出的H₂OT分层即插即用框架通过剪枝冗余姿态令牌并恢复完整序列,显著提升了基于视频的3D人体姿态估计效率,在降低计算成本的同时保持高精度。
English: This paper introduces H₂OT, a hierarchical plug-and-play framework that enhances the efficiency of video-based 3D human pose estimation by pruning redundant pose tokens and recovering full sequences, achieving high performance with reduced computational costs.

Authors:Yinjie Wang, Ling Yang, Bowen Li, Ye Tian, Ke Shen, Mengdi Wang
Title: Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
Abstract:
We propose TraceRL, a trajectory-aware reinforcement learning framework for diffusion language models (DLMs) that incorporates preferred inference trajectory into post-training, and is applicable across different architectures. Equipped with a diffusion-based value model that enhances training stability, we demonstrate improved reasoning performance on complex math and coding tasks. Besides, it can also be applied to adapt block-specific models to larger blocks, which improves sampling flexibility. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than 7B-scale AR models, TraDo-4B-Instruct still consistently outperforms them across complex math reasoning tasks. TraDo-8B-Instruct achieves relative accuracy improvements of 6.1% over Qwen2.5-7B-Instruct and 51.3% over Llama3.1-8B-Instruct on mathematical reasoning benchmarks. Through curriculum learning, we also derive the first long-CoT DLM, outperforming Qwen2.5-7B-Instruct on MATH500 with an 18.1% relative accuracy gain. To facilitate reproducible research and practical applications, we release a comprehensive open-source framework for building, training, and deploying diffusion LLMs across diverse architectures. The framework integrates accelerated KV-cache techniques and inference engines for both inference and reinforcement learning, and includes implementations of various supervised fine-tuning and RL methods for mathematics, coding, and general tasks. Code and Models: https://github.com/Gen-Verse/dLLM-RL
中文: TraceRL是一种面向扩散语言模型的轨迹感知强化学习框架,通过整合偏好推理轨迹提升复杂任务中的推理性能,并实现了跨架构的灵活模型适配,其TraDo模型在数学推理上显著超越同类模型。
English: TraceRL is a trajectory-aware reinforcement learning framework for diffusion language models that enhances reasoning performance on complex tasks and enables flexible model adaptation, achieving state-of-the-art results with its TraDo models.

Authors:Wenxuan Huang, Shuang Chen, Zheyong Xie, Shaosheng Cao, Shixiang Tang, Yufan Shen, Qingyu Yin, Wenbo Hu, Xiaoman Wang, Yuntian Tang, Junbo Qiao, Yue Guo, Yao Hu, Zhenfei Yin, Philip Torr, Yu Cheng, Wanli Ouyang, Shaohui Lin
Title: Interleaving Reasoning for Better Text-to-Image Generation
Abstract:
Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .
Chinese: 提出的交错推理生成(IRG)框架通过交替进行文本推理与图像合成,有效提升了文本到图像生成中的细节保持与指令遵循能力,并采用两阶段训练方法在多个基准测试中实现了最先进的性能。
English: The proposed Interleaving Reasoning Generation (IRG) framework alternates between text-based reasoning and image synthesis to enhance detail preservation and instruction following in text-to-image generation, achieving state-of-the-art performance across multiple benchmarks through a two-stage training approach.

Authors:Bing Han, Chen Zhu, Dong Han, Rui Yu, Songliang Cao, Jianhui Wu, Scott Chapman, Zijian Wang, Bangyou Zheng, Wei Guo, Marie Weiss, Benoit de Solan, Andreas Hund, Lukas Roth, Kirchgessner Norbert, Andrea Visioni, Yufeng Ge, Wenjuan Li, Alexis Comar, Dong Jiang, Dejun Han, Fred Baret, Yanfeng Ding, Hao Lu, Shouyang Liu
Title: FoMo4Wheat: Toward reliable crop vision foundation models with globally curated data
Abstract:
Vision-driven field monitoring is central to digital agriculture, yet models built on general-domain pretrained backbones often fail to generalize across tasks, owing to the interaction of fine, variable canopy structures with fluctuating field conditions. We present FoMo4Wheat, one of the first crop-domain vision foundation model pretrained with self-supervision on ImAg4Wheat, the largest and most diverse wheat image dataset to date (2.5 million high-resolution images collected over a decade at 30 global sites, spanning >2,000 genotypes and >500 environmental conditions). This wheat-specific pretraining yields representations that are robust for wheat and transferable to other crops and weeds. Across ten in-field vision tasks at canopy and organ levels, FoMo4Wheat models consistently outperform state-of-the-art models pretrained on general-domain dataset. These results demonstrate the value of crop-specific foundation models for reliable in-field perception and chart a path toward a universal crop foundation model with cross-species and cross-task capabilities. FoMo4Wheat models and the ImAg4Wheat dataset are publicly available online: https://github.com/PheniX-Lab/FoMo4Wheat and https://huggingface.co/PheniX-Lab/FoMo4Wheat. The demonstration website is: https://fomo4wheat.phenix-lab.com/.
中文: FoMo4Wheat是首个基于大规模小麦图像数据集ImAg4Wheat预训练的作物领域视觉基础模型,在十项田间任务中表现卓越,不仅对小麦具有强鲁棒性,还能迁移应用于其他作物。
English: FoMo4Wheat is a pioneering crop-domain vision foundation model pretrained on the extensive ImAg4Wheat dataset, demonstrating superior performance across ten in-field tasks and robustness for wheat applications while being transferable to other crops.

Authors:Morteza Kiani Haftlang, Mohammadhossein Malmir, Foroutan Parand, Umberto Michelucci, Safouane El Ghazouali
Title: Barlow-Swin: Toward a novel siamese-based segmentation architecture using Swin-Transformers
Abstract:
Medical image segmentation is a critical task in clinical workflows, particularly for the detection and delineation of pathological regions. While convolutional architectures like U-Net have become standard for such tasks, their limited receptive field restricts global context modeling. Recent efforts integrating transformers have addressed this, but often result in deep, computationally expensive models unsuitable for real-time use. In this work, we present a novel end-to-end lightweight architecture designed specifically for real-time binary medical image segmentation. Our model combines a Swin Transformer-like encoder with a U-Net-like decoder, connected via skip pathways to preserve spatial detail while capturing contextual information. Unlike existing designs such as Swin Transformer or U-Net, our architecture is significantly shallower and competitively efficient. To improve the encoder's ability to learn meaningful features without relying on large amounts of labeled data, we first train it using Barlow Twins, a self-supervised learning method that helps the model focus on important patterns by reducing unnecessary repetition in the learned features. After this pretraining, we fine-tune the entire model for our specific task. Experiments on benchmark binary segmentation tasks demonstrate that our model achieves competitive accuracy with substantially reduced parameter count and faster inference, positioning it as a practical alternative for deployment in real-time and resource-limited clinical environments. The code for our method is available at Github repository: https://github.com/mkianih/Barlow-Swin.
中文: 本文提出了一种轻量级实时医学图像二值分割模型,结合类Swin Transformer编码器与U-Net解码器,通过自监督预训练实现参数量更少、推理更快且精度相当的性能。
English: This paper introduces a lightweight, real-time binary medical image segmentation model that combines a Swin Transformer-like encoder with a U-Net decoder, using self-supervised pretraining to achieve competitive accuracy with fewer parameters and faster inference.

Authors:James Xu Zhao, Bryan Hooi, See-Kiong Ng
Title: Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet
Abstract:
Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge
中文: 测试时扩展虽能增强推理计算,但在知识密集型任务中效果不佳,不仅无法持续提升准确性,反而常增加幻觉,因为模型可能选择弃答或产生确认偏误,而非改善事实回忆。
English: Test-time scaling enhances inference computation but proves ineffective for knowledge-intensive tasks, often increasing hallucinations without consistently improving accuracy, as it may lead to abstention or confirmation bias rather than better factual recall.

Authors:Matteo Muratori, Joël Seytre
Title: ToonOut: Fine-tuned Background-Removal for Anime Characters
Abstract:
While state-of-the-art background removal models excel at realistic imagery, they frequently underperform in specialized domains such as anime-style content, where complex features like hair and transparency present unique challenges. To address this limitation, we collected and annotated a custom dataset of 1,228 high-quality anime images of characters and objects, and fine-tuned the open-sourced BiRefNet model on this dataset. This resulted in marked improvements in background removal accuracy for anime-style images, increasing from 95.3% to 99.5% for our newly introduced Pixel Accuracy metric. We are open-sourcing the code, the fine-tuned model weights, as well as the dataset at: https://github.com/MatteoKartoon/BiRefNet.
Chinese: 本研究针对背景移除模型在动漫风格内容中的表现不佳问题,通过在1,228张标注动漫图像数据集上微调BiRefNet模型,将像素精度从95.3%显著提升至99.5%,并公开了所有相关资源。
English: The study addresses the underperformance of background removal models in anime-style content by fine-tuning the BiRefNet model on a custom dataset of 1,228 annotated images, significantly improving accuracy from 95.3% to 99.5% and releasing all resources publicly.

Authors:Mohammad Reza Mirbagheri, Mohammad Mahdi Mirkamali, Zahra Motoshaker Arani, Ali Javeri, Amir Mahdi Sadeghzadeh, Rasool Jalili
Title: EPT Benchmark: Evaluation of Persian Trustworthiness in Large Language Models
Abstract:
Large Language Models (LLMs), trained on extensive datasets using advanced deep learning architectures, have demonstrated remarkable performance across a wide range of language tasks, becoming a cornerstone of modern AI technologies. However, ensuring their trustworthiness remains a critical challenge, as reliability is essential not only for accurate performance but also for upholding ethical, cultural, and social values. Careful alignment of training data and culturally grounded evaluation criteria are vital for developing responsible AI systems. In this study, we introduce the EPT (Evaluation of Persian Trustworthiness) metric, a culturally informed benchmark specifically designed to assess the trustworthiness of LLMs across six key aspects: truthfulness, safety, fairness, robustness, privacy, and ethical alignment. We curated a labeled dataset and evaluated the performance of several leading models - including ChatGPT, Claude, DeepSeek, Gemini, Grok, LLaMA, Mistral, and Qwen - using both automated LLM-based and human assessments. Our results reveal significant deficiencies in the safety dimension, underscoring the urgent need for focused attention on this critical aspect of model behavior. Furthermore, our findings offer valuable insights into the alignment of these models with Persian ethical-cultural values and highlight critical gaps and opportunities for advancing trustworthy and culturally responsible AI. The dataset is publicly available at: https://github.com/Rezamirbagheri110/EPT-Benchmark.
Chinese: 本研究引入EPT指标,这是一个基于文化背景的基准,用于评估大型语言模型在六个关键方面的可信度,揭示了显著的安全缺陷,并强调了在AI开发中与波斯伦理文化价值观保持一致的必要性。
English: This study introduces the EPT metric, a culturally informed benchmark for evaluating the trustworthiness of LLMs across six key aspects, revealing significant safety deficiencies and highlighting the need for alignment with Persian ethical-cultural values in AI development.

Authors:Simon Pezold, Jérôme A. Kurylec, Jan S. Liechti, Beat P. Müller, Joël L. Lavanchy
Title: Leveraging Generic Foundation Models for Multimodal Surgical Data Analysis
Abstract:
We investigate how both the adaptation of a generic foundation model via transfer learning and the integration of complementary modalities from the operating room (OR) can support surgical data science. To this end, we use V-JEPA as the single-modality foundation of a multimodal model for minimally invasive surgery support. We analyze how the model's downstream performance can benefit (a) from finetuning on unlabeled surgical video data and (b) from providing additional time-resolved data streams from the OR in a multimodal setup. In an in-house dataset of liver surgery videos, we analyze the tasks of predicting hospital length of stay and postoperative complications. In videos of the public HeiCo dataset, we analyze the task of surgical phase recognition. As a baseline, we apply pretrained V-JEPA to all tasks. We then finetune it on unlabeled, held-out videos to investigate its change in performance after domain adaptation. Following the idea of modular decision support networks, we integrate additional data streams from the OR by training a separate encoder to form a shared representation space with V-JEPA's embeddings. Our experiments show that finetuning on domain-specific data increases model performance. On the in-house data, integrating additional time-resolved data likewise benefits the model. On the HeiCo data, accuracy of the pretrained video-only, single-modality baseline setup is on par with the top-performing submissions of the EndoVis2017 challenge, while finetuning on domain-specific data increases accuracy further. Our results thus demonstrate how surgical data science can leverage public, generic foundation models. Likewise, they indicate the potential of domain adaptation and of integrating suitable complementary data streams from the OR. To support further research, we release our code and model weights at https://github.com/DigitalSurgeryLab-Basel/ML-CDS-2025.
中文: 本研究证明,通过在未标记手术视频上微调V-JEPA基础模型并整合多模态手术室数据流,能显著提升住院时长预测和手术阶段识别等外科数据科学任务的性能。
English: This study demonstrates that finetuning the V-JEPA foundation model on unlabeled surgical videos and integrating multimodal OR data streams significantly enhances performance in surgical data science tasks like length-of-stay prediction and phase recognition.

Authors:Yufeng Cheng, Wenxu Wu, Shaojin Wu, Mengqi Huang, Fei Ding, Qian He
Title: UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Abstract:
Recent advancements in image customization exhibit a wide range of application prospects due to stronger customization capabilities. However, since we humans are more sensitive to faces, a significant challenge remains in preserving consistent identity while avoiding identity confusion with multi-reference images, limiting the identity scalability of customization models. To address this, we present UMO, a Unified Multi-identity Optimization framework, designed to maintain high-fidelity identity preservation and alleviate identity confusion with scalability. With "multi-to-multi matching" paradigm, UMO reformulates multi-identity generation as a global assignment optimization problem and unleashes multi-identity consistency for existing image customization methods generally through reinforcement learning on diffusion models. To facilitate the training of UMO, we develop a scalable customization dataset with multi-reference images, consisting of both synthesised and real parts. Additionally, we propose a new metric to measure identity confusion. Extensive experiments demonstrate that UMO not only improves identity consistency significantly, but also reduces identity confusion on several image customization methods, setting a new state-of-the-art among open-source methods along the dimension of identity preserving. Code and model: https://github.com/bytedance/UMO
中文摘要:UMO框架通过全局分配优化和强化学习,提升了图像定制中的多身份一致性,有效减少身份混淆,并在多个定制方法中实现了最先进的身份保持效果。
English Summary: The UMO framework enhances image customization by optimizing multi-identity preservation through a global assignment approach and reinforcement learning, significantly reducing identity confusion while maintaining high fidelity across diverse reference images.

Authors:Max Malyi, Jonathan Shek, Alasdair McDonald, Andre Biscaya
Title: A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs
Abstract:
Effective Operation and Maintenance (O&M) is critical to reducing the Levelised Cost of Energy (LCOE) from wind power, yet the unstructured, free-text nature of turbine maintenance logs presents a significant barrier to automated analysis. Our paper addresses this by presenting a novel and reproducible framework for benchmarking Large Language Models (LLMs) on the task of classifying these complex industrial records. To promote transparency and encourage further research, this framework has been made publicly available as an open-source tool. We systematically evaluate a diverse suite of state-of-the-art proprietary and open-source LLMs, providing a foundational assessment of their trade-offs in reliability, operational efficiency, and model calibration. Our results quantify a clear performance hierarchy, identifying top models that exhibit high alignment with a benchmark standard and trustworthy, well-calibrated confidence scores. We also demonstrate that classification performance is highly dependent on the task's semantic ambiguity, with all models showing higher consensus on objective component identification than on interpretive maintenance actions. Given that no model achieves perfect accuracy and that calibration varies dramatically, we conclude that the most effective and responsible near-term application is a Human-in-the-Loop system, where LLMs act as a powerful assistant to accelerate and standardise data labelling for human experts, thereby enhancing O&M data quality and downstream reliability analysis.
本文提出了一种可复现的开源框架,用于评估大语言模型在风机维护日志分类中的表现,筛选出最优模型并建议采用人机协同系统以实现最佳准确性与可靠性。
Our paper introduces a reproducible open-source framework for benchmarking Large Language Models on wind turbine maintenance log classification, identifying top-performing models and recommending human-in-the-loop systems for optimal accuracy and reliability.

Authors:Valentin Quesnel, Damien Sileo
Title: Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem
Abstract:
The scarcity of high-quality, logically sound data is a critical bottleneck for advancing the mathematical reasoning of Large Language Models (LLMs). Our work confronts this challenge by turning decades of automated theorem proving research into a scalable data engine. Rather than relying on error-prone LLMs or complex proof-assistant syntax like Lean and Isabelle, our framework leverages E-prover's saturation capabilities on the vast TPTP axiom library to derive a massive, guaranteed-valid corpus of theorems. Our pipeline is principled and simple: saturate axioms, filter for "interesting" theorems, and generate tasks. With no LLMs in the loop, we eliminate factual errors by construction. This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction. Our zero-shot experiments on frontier models reveal a clear weakness: performance collapses on tasks requiring deep, structural reasoning. Our framework provides both the diagnostic tool to measure this gap and a scalable source of symbolic training data to address it. We make the code and data publicly available. https://github.com/sileod/reasoning_core https://hf.co/datasets/reasoning-core/rc1
中文: 本研究通过利用E-prover在TPTP公理库上的饱和能力构建可扩展数据引擎,生成保证有效的定理数据,转化为三个难度可控的推理任务,既揭示了前沿模型在深度推理上的缺陷,又提供了诊断工具和训练数据。
English: This work addresses the scarcity of high-quality mathematical reasoning data for LLMs by creating a scalable data engine using E-prover and the TPTP library to generate guaranteed-valid theorems, which are then transformed into three difficulty-controlled challenges that reveal models' weaknesses in deep reasoning while providing both diagnostic tools and training data.

Authors:Yuntao Du, Yuetian Chen, Hanshen Xiao, Bruno Ribeiro, Ninghui Li
Title: Imitative Membership Inference Attack
Abstract:
A Membership Inference Attack (MIA) assesses how much a target machine learning model reveals about its training data by determining whether specific query instances were part of the training set. State-of-the-art MIAs rely on training hundreds of shadow models that are independent of the target model, leading to significant computational overhead. In this paper, we introduce Imitative Membership Inference Attack (IMIA), which employs a novel imitative training technique to strategically construct a small number of target-informed imitative models that closely replicate the target model's behavior for inference. Extensive experimental results demonstrate that IMIA substantially outperforms existing MIAs in various attack settings while only requiring less than 5% of the computational cost of state-of-the-art approaches.
中文摘要:本文提出的IMIA通过构建少量目标导向的模仿模型,在显著降低95%以上计算成本的同时,实现了比现有成员推理攻击更优越的性能。
English Summary: The paper introduces IMIA, a novel membership inference attack that uses target-informed imitative models to outperform existing methods while reducing computational costs by over 95%.

Authors:Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar
Title: D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning
Abstract:
Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
中文摘要:本研究针对网络迷因中的黑色幽默检测,提出了一个新的数据集和增强推理框架,通过整合视觉、文本及自我优化的推理特征,采用三流网络在多项任务中显著超越现有方法。
English Summary: This study introduces a novel dataset and a reasoning-augmented framework for detecting dark humor in online memes, which outperforms existing methods by integrating visual, textual, and self-refined reasoning features through a tri-stream network.

Authors:Qing Xu, Wenting Duan, Zhen Chen
Title: Co-Seg: Mutual Prompt-Guided Collaborative Learning for Tissue and Nuclei Segmentation
Abstract:
Histopathology image analysis is critical yet challenged by the demand of segmenting tissue regions and nuclei instances for tumor microenvironment and cellular morphology analysis. Existing studies focused on tissue semantic segmentation or nuclei instance segmentation separately, but ignored the inherent relationship between these two tasks, resulting in insufficient histopathology understanding. To address this issue, we propose a Co-Seg framework for collaborative tissue and nuclei segmentation. Specifically, we introduce a novel co-segmentation paradigm, allowing tissue and nuclei segmentation tasks to mutually enhance each other. To this end, we first devise a region-aware prompt encoder (RP-Encoder) to provide high-quality semantic and instance region prompts as prior constraints. Moreover, we design a mutual prompt mask decoder (MP-Decoder) that leverages cross-guidance to strengthen the contextual consistency of both tasks, collaboratively computing semantic and instance segmentation masks. Extensive experiments on the PUMA dataset demonstrate that the proposed Co-Seg surpasses state-of-the-arts in the semantic, instance and panoptic segmentation of tumor tissues and nuclei instances. The source code is available at https://github.com/xq141839/Co-Seg.
中文: 提出的Co-Seg框架通过区域感知提示编码器和互提示掩码解码器实现组织与细胞核的协同分割,在PUMA数据集上取得了最优性能。
English: The proposed Co-Seg framework introduces collaborative tissue and nuclei segmentation through a region-aware prompt encoder and mutual prompt mask decoder, achieving state-of-the-art performance on the PUMA dataset.

Authors:Jie Yang, Jiajun Chen, Zhangyue Yin, Shuo Chen, Yuxin Wang, Yiran Guo, Yuan Li, Yining Zheng, Xuanjing Huang, Xipeng Qiu
Title: VehicleWorld: A Highly Integrated Multi-Device Environment for Intelligent Vehicle Interaction
Abstract:
Intelligent vehicle cockpits present unique challenges for API Agents, requiring coordination across tightly-coupled subsystems that exceed typical task environments' complexity. Traditional Function Calling (FC) approaches operate statelessly, requiring multiple exploratory calls to build environmental awareness before execution, leading to inefficiency and limited error recovery. We introduce VehicleWorld, the first comprehensive environment for the automotive domain, featuring 30 modules, 250 APIs, and 680 properties with fully executable implementations that provide real-time state information during agent execution. This environment enables precise evaluation of vehicle agent behaviors across diverse, challenging scenarios. Through systematic analysis, we discovered that direct state prediction outperforms function calling for environmental control. Building on this insight, we propose State-based Function Call (SFC), a novel approach that maintains explicit system state awareness and implements direct state transitions to achieve target conditions. Experimental results demonstrate that SFC significantly outperforms traditional FC approaches, achieving superior execution accuracy and reduced latency. We have made all implementation code publicly available on Github https://github.com/OpenMOSS/VehicleWorld.
中文: 本文介绍了首个汽车领域综合环境VehicleWorld及其可执行模块与API,并提出基于状态的函数调用方法,该方法通过保持系统状态感知显著优于传统函数调用,实现了更高精度和效率。
English: This paper introduces VehicleWorld, a comprehensive automotive environment with executable modules and APIs, and proposes State-based Function Call (SFC), which outperforms traditional function calling by maintaining system state awareness for improved accuracy and efficiency.

Authors:Xiaobei Zhao, Xingqi Lyu, Xiang Li
Title: T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation
Abstract:
Agricultural robotic agents have been becoming powerful helpers in a wide range of agricultural tasks, however, still heavily rely on manual operation or fixed railways for movement. To address this limitation, the AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling agents to navigate to the target positions following the natural language instructions. AgriVLN effectively understands the simple instructions, but often misunderstands the complex ones. To bridge this gap, we propose the method of Translator for Agricultural Robotic Agents on Vision-and-Language Navigation (T-araVLN), in which the Instruction Translator module translates the original instruction to be more refined and precise. When evaluated on the A2A benchmark, our T-araVLN effectively improves Success Rate from 0.47 to 0.63 and reduces Navigation Error from 2.91m to 2.28m, demonstrating the state-of-the-art performance in the agricultural domain. Code: https://github.com/AlexTraveling/T-araVLN.
中文:提出的T-araVLN方法通过指令翻译模块优化自然语言指令,在A2A基准测试中实现了63%的成功率并降低了导航误差,展现了农业机器人导航的最先进性能。
English: The proposed T-araVLN method enhances agricultural robot navigation by refining natural language instructions through an Instruction Translator module, achieving state-of-the-art performance with a 63% success rate and reduced navigation error on the A2A benchmark.

Authors:Jack Wilkie, Hanan Hindy, Christos Tachtatzis, Robert Atkinson
Title: Contrastive Self-Supervised Network Intrusion Detection using Augmented Negative Pairs
Abstract:
Network intrusion detection remains a critical challenge in cybersecurity. While supervised machine learning models achieve state-of-the-art performance, their reliance on large labelled datasets makes them impractical for many real-world applications. Anomaly detection methods, which train exclusively on benign traffic to identify malicious activity, suffer from high false positive rates, limiting their usability. Recently, self-supervised learning techniques have demonstrated improved performance with lower false positive rates by learning discriminative latent representations of benign traffic. In particular, contrastive self-supervised models achieve this by minimizing the distance between similar (positive) views of benign traffic while maximizing it between dissimilar (negative) views. Existing approaches generate positive views through data augmentation and treat other samples as negative. In contrast, this work introduces Contrastive Learning using Augmented Negative pairs (CLAN), a novel paradigm for network intrusion detection where augmented samples are treated as negative views - representing potentially malicious distributions - while other benign samples serve as positive views. This approach enhances both classification accuracy and inference efficiency after pretraining on benign traffic. Experimental evaluation on the Lycos2017 dataset demonstrates that the proposed method surpasses existing self-supervised and anomaly detection techniques in a binary classification task. Furthermore, when fine-tuned on a limited labelled dataset, the proposed approach achieves superior multi-class classification performance compared to existing self-supervised models.
中文: 本文提出CLAN,一种用于网络入侵检测的新型自监督对比学习方法,将增强样本视为负样本视图以提高分类精度和效率,在Lycos2017数据集上超越了现有技术。
English: This paper introduces CLAN, a novel self-supervised contrastive learning method for network intrusion detection that treats augmented samples as negative views to improve classification accuracy and efficiency, outperforming existing techniques on the Lycos2017 dataset.

Authors:Jibai Lin, Bo Ma, Yating Yang, Xi Zhou, Rong Ma, Turghun Osman, Ahtamjan Ahmat, Rui Dong, Lei Wang
Title: TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement
Abstract:
Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired "winning" (balanced preservation-compliance) and "losing" (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE's superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE's versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.
中文摘要:TIDE框架通过目标监督的三元组对齐和偏好学习,在无需测试时微调的情况下解决了主题驱动图像生成中身份保持与指令遵循的平衡难题,在多个基准测试中展现出卓越性能。
English Summary: The TIDE framework enhances subject-driven image generation by balancing subject identity preservation with instruction compliance through target-supervised triplet alignment and preference learning, achieving superior performance across multiple benchmarks without test-time fine-tuning.

Authors:Zhongxiang Xie, Shuangxi Miao, Yuhan Jiang, Zhewei Zhang, Jing Yao, Xuecao Li, Jianxi Huang, Pedram Ghamisi
Title: FSG-Net: Frequency-Spatial Synergistic Gated Network for High-Resolution Remote Sensing Change Detection
Abstract:
Change detection from high-resolution remote sensing images lies as a cornerstone of Earth observation applications, yet its efficacy is often compromised by two critical challenges. First, false alarms are prevalent as models misinterpret radiometric variations from temporal shifts (e.g., illumination, season) as genuine changes. Second, a non-negligible semantic gap between deep abstract features and shallow detail-rich features tends to obstruct their effective fusion, culminating in poorly delineated boundaries. To step further in addressing these issues, we propose the Frequency-Spatial Synergistic Gated Network (FSG-Net), a novel paradigm that aims to systematically disentangle semantic changes from nuisance variations. Specifically, FSG-Net first operates in the frequency domain, where a Discrepancy-Aware Wavelet Interaction Module (DAWIM) adaptively mitigates pseudo-changes by discerningly processing different frequency components. Subsequently, the refined features are enhanced in the spatial domain by a Synergistic Temporal-Spatial Attention Module (STSAM), which amplifies the saliency of genuine change regions. To finally bridge the semantic gap, a Lightweight Gated Fusion Unit (LGFU) leverages high-level semantics to selectively gate and integrate crucial details from shallow layers. Comprehensive experiments on the CDD, GZ-CD, and LEVIR-CD benchmarks validate the superiority of FSG-Net, establishing a new state-of-the-art with F1-scores of 94.16%, 89.51%, and 91.27%, respectively. The code will be made available at https://github.com/zxXie-Air/FSG-Net after a possible publication.
中文摘要:FSG-Net通过频率-空间协同处理机制解决遥感变化检测中的误报和特征融合难题,在多个基准测试中实现了最优性能。
English Summary: The proposed FSG-Net addresses false alarms and feature fusion challenges in remote sensing change detection through frequency-spatial synergistic processing, achieving state-of-the-art performance on multiple benchmarks.

Authors:Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian
Title: Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning
Abstract:
Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.
中文摘要:Tree of Agents (TOA) 框架通过多智能体协作和树状推理路径,有效解决了大语言模型处理长文本时的位置偏见和幻觉问题,在保持高效的同时使用轻量模型实现了卓越性能。
English Summary: The Tree of Agents (TOA) framework addresses long-context challenges in LLMs by employing multi-agent collaboration with tree-structured reasoning paths, achieving superior performance with compact models while maintaining efficiency through caching and pruning strategies.

Authors:Xudong Mou, Rui Wang, Tiejun Wang, Renyu Yang, Shiru Chen, Jie Sun, Tianyu Wo, Xudong Liu
Title: CAPMix: Robust Time Series Anomaly Detection Based on Abnormal Assumptions with Dual-Space Mixup
Abstract:
Time series anomaly detection (TSAD) is a vital yet challenging task, particularly in scenarios where labeled anomalies are scarce and temporal dependencies are complex. Recent anomaly assumption (AA) approaches alleviate the lack of anomalies by injecting synthetic samples and training discriminative models. Despite promising results, these methods often suffer from two fundamental limitations: patchy generation, where scattered anomaly knowledge leads to overly simplistic or incoherent anomaly injection, and Anomaly Shift, where synthetic anomalies either resemble normal data too closely or diverge unrealistically from real anomalies, thereby distorting classification boundaries. In this paper, we propose CAPMix, a controllable anomaly augmentation framework that addresses both issues. First, we design a CutAddPaste mechanism to inject diverse and complex anomalies in a targeted manner, avoiding patchy generation. Second, we introduce a label revision strategy to adaptively refine anomaly labels, reducing the risk of anomaly shift. Finally, we employ dual-space mixup within a temporal convolutional network to enforce smoother and more robust decision boundaries. Extensive experiments on five benchmark datasets, including AIOps, UCR, SWaT, WADI, and ESA, demonstrate that CAPMix achieves significant improvements over state-of-the-art baselines, with enhanced robustness against contaminated training data. The code is available at https://github.com/alsike22/CAPMix.
中文:提出的CAPMix框架通过定向异常注入机制和自适应标签优化,解决了现有方法中异常生成零散和异常偏移的问题,在多个基准测试中实现了卓越的检测性能。
English: The proposed CAPMix framework enhances time series anomaly detection by introducing a targeted anomaly injection mechanism and adaptive label refinement to overcome limitations of patchy generation and anomaly shift, achieving superior performance across multiple benchmarks.

Authors:Yixiao Li, Xin Li, Chris Wei Zhou, Shuo Xing, Hadi Amirpour, Xiaoshuai Hao, Guanghui Yue, Baoquan Zhao, Weide Liu, Xiaoyuan Yang, Zhengzhong Tu, Xinyu Li, Chuanbiao Song, Chenqi Zhang, Jun Lan, Huijia Zhu, Weiqiang Wang, Xiaoyan Sun, Shishun Tian, Dongyang Yan, Weixia Zhang, Junlin Chen, Wei Sun, Zhihua Wang, Zhuohang Shi, Zhizun Luo, Hang Ouyang, Tianxin Xiao, Fan Yang, Zhaowang Wu, Kaixin Deng
Title: VQualA 2025 Challenge on Image Super-Resolution Generated Content Quality Assessment: Methods and Results
Abstract:
This paper presents the ISRGC-Q Challenge, built upon the Image Super-Resolution Generated Content Quality Assessment (ISRGen-QA) dataset, and organized as part of the Visual Quality Assessment (VQualA) Competition at the ICCV 2025 Workshops. Unlike existing Super-Resolution Image Quality Assessment (SR-IQA) datasets, ISRGen-QA places a greater emphasis on SR images generated by the latest generative approaches, including Generative Adversarial Networks (GANs) and diffusion models. The primary goal of this challenge is to analyze the unique artifacts introduced by modern super-resolution techniques and to evaluate their perceptual quality effectively. A total of 108 participants registered for the challenge, with 4 teams submitting valid solutions and fact sheets for the final testing phase. These submissions demonstrated state-of-the-art (SOTA) performance on the ISRGen-QA dataset. The project is publicly available at: https://github.com/Lighting-YXLI/ISRGen-QA.
Chinese: ISRGC-Q挑战赛基于ISRGen-QA数据集,是ICCV 2025研讨会的一部分,重点评估由GAN和扩散模型等先进生成方法产生的超分辨率图像的感知质量及伪影,最终提交方案达到了最先进的性能水平。
English: The ISRGC-Q Challenge, based on the ISRGen-QA dataset and part of the ICCV 2025 Workshops, focuses on assessing perceptual quality and artifacts in super-resolution images produced by advanced generative models like GANs and diffusion models, with top submissions achieving state-of-the-art results.

Authors:Hiroya Makino, Seigo Ito
Title: MAPF-HD: Multi-Agent Path Finding in High-Density Environments
Abstract:
Multi-agent path finding (MAPF) involves planning efficient paths for multiple agents to move simultaneously while avoiding collisions. In typical warehouse environments, agents are often sparsely distributed along aisles. However, increasing the agent density can improve space efficiency. When the agent density is high, we must optimize the paths not only for goal-assigned agents but also for those obstructing them. This study proposes a novel MAPF framework for high-density environments (MAPF-HD). Several studies have explored MAPF in similar settings using integer linear programming (ILP). However, ILP-based methods require substantial computation time to optimize all agent paths simultaneously. Even in small grid-based environments with fewer than $100$ cells, these computations can incur tens to hundreds of seconds. These high computational costs render these methods impractical for large-scale applications such as automated warehouses and valet parking. To address these limitations, we introduce the phased null-agent swapping (PHANS) method. PHANS employs a heuristic approach to incrementally swap positions between agents and empty vertices. This method solves the MAPF-HD problem within seconds to tens of seconds, even in large environments containing more than $700$ cells. The proposed method can potentially improve efficiency in various real-world applications such as warehouse logistics, traffic management, or crowd control. Code is available at https://github.com/ToyotaCRDL/MAPF-in-High-Density-Envs.
Chinese: 本研究针对高密度环境提出了一种新的多智能体路径规划框架(MAPF-HD),采用分阶段空智能体交换(PHANS)方法,通过启发式交换智能体与空顶点位置,可在数秒至数十秒内解决大型环境中的路径规划问题,为仓库物流和交通管理等实际应用提供了可行解决方案。
English: This study introduces a novel multi-agent path finding framework for high-density environments (MAPF-HD) called phased null-agent swapping (PHANS), which efficiently solves path planning problems within seconds to tens of seconds in large environments by heuristically swapping agents with empty vertices, offering practical applications in warehouse logistics and traffic management.

Authors:Jianpeng Zhao, Chenyu Yuan, Weiming Luo, Haoling Xie, Guangwei Zhang, Steven Jige Quan, Zixuan Yuan, Pengyang Wang, Denghui Zhang
Title: Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation
Abstract:
Questionnaire-based surveys are foundational to social science research and public policymaking, yet traditional survey methods remain costly, time-consuming, and often limited in scale. This paper explores a new paradigm: simulating virtual survey respondents using Large Language Models (LLMs). We introduce two novel simulation settings, namely Partial Attribute Simulation (PAS) and Full Attribute Simulation (FAS), to systematically evaluate the ability of LLMs to generate accurate and demographically coherent responses. In PAS, the model predicts missing attributes based on partial respondent profiles, whereas FAS involves generating complete synthetic datasets under both zero-context and context-enhanced conditions. We curate a comprehensive benchmark suite, LLM-S^3 (Large Language Model-based Sociodemographic Survey Simulation), that spans 11 real-world public datasets across four sociological domains. Our evaluation of multiple mainstream LLMs (GPT-3.5/4 Turbo, LLaMA 3.0/3.1-8B) reveals consistent trends in prediction performance, highlights failure modes, and demonstrates how context and prompt design impact simulation fidelity. This work establishes a rigorous foundation for LLM-driven survey simulations, offering scalable and cost-effective tools for sociological research and policy evaluation. Our code and dataset are available at: https://github.com/dart-lab-research/LLM-S-Cube-Benchmark
本文提出了一种利用大型语言模型通过部分和完整属性模拟方法生成虚拟调查对象的新范式,为可扩展且经济高效的社会学研究建立了基准。
This paper introduces a novel approach using Large Language Models to simulate virtual survey respondents through Partial and Full Attribute Simulation methods, establishing a benchmark for scalable and cost-effective sociological research.

Authors:Jeongmin Yu, Susang Kim, Kisu Lee, Taekyoung Kwon, Won-Yong Shin, Ha Young Kim
Title: Multi-View Slot Attention Using Paraphrased Texts for Face Anti-Spoofing
Abstract:
Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP's patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g., 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets. Code: https://github.com/Elune001/MVP-FAS.
中文: MVP-FAS框架通过多视角槽位注意力和多文本补丁对齐模块,充分利用CLIP的补丁嵌入和多样化文本提示,显著提升了人脸防伪的跨领域泛化性能。
English: The proposed MVP-FAS framework enhances face anti-spoofing by leveraging multi-view slot attention and multi-text patch alignment to better utilize CLIP's patch embeddings and multiple text prompts, achieving superior cross-domain generalization.

Authors:Ruiming Du, Guangxun Zhai, Tian Qiu, Yu Jiang
Title: Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap
Abstract:
The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings highlight the efficacy of sparse convolutional backbones and transformer-based instance segmentation, while also emphasizing the complementary role of modeling-based and augmentation-based synthetic data generation for sim-to-real learning in reducing annotation demands. In general, this study bridges the gap between algorithmic advances and practical deployment, providing immediate tools for researchers and a roadmap for developing data-efficient and generalizable deep learning solutions in 3D plant phenotyping. Data and code are available at https://github.com/perrydoremi/PlantSegStudio.
中文: 本综述通过评估深度学习方法、引入基准测试框架,并证明稀疏卷积网络和合成数据的有效性,解决了3D分割在植物表型分析中的局限性,从而弥合了算法进展与实际应用之间的差距。
English: This review tackles the limitations of 3D segmentation in plant phenotyping by evaluating deep learning methods, introducing a benchmarking framework, and demonstrating the effectiveness of sparse convolutional networks and synthetic data to bridge algorithmic advances with practical applications.

Authors:Hang Fan, Yu Shi, Zongliang Fu, Shuo Chen, Wei Wei, Wei Xu, Jian Li
Title: WindFM: An Open-Source Foundation Model for Zero-Shot Wind Power Forecasting
Abstract:
High-quality wind power forecasting is crucial for the operation of modern power grids. However, prevailing data-driven paradigms either train a site-specific model which cannot generalize to other locations or rely on fine-tuning of general-purpose time series foundation models which are difficult to incorporate domain-specific data in the energy sector. This paper introduces WindFM, a lightweight and generative Foundation Model designed specifically for probabilistic wind power forecasting. WindFM employs a discretize-and-generate framework. A specialized time-series tokenizer first converts continuous multivariate observations into discrete, hierarchical tokens. Subsequently, a decoder-only Transformer learns a universal representation of wind generation dynamics by autoregressively pre-training on these token sequences. Using the comprehensive WIND Toolkit dataset comprising approximately 150 billion time steps from more than 126,000 sites, WindFM develops a foundational understanding of the complex interplay between atmospheric conditions and power output. Extensive experiments demonstrate that our compact 8.1M parameter model achieves state-of-the-art zero-shot performance on both deterministic and probabilistic tasks, outperforming specialized models and larger foundation models without any fine-tuning. In particular, WindFM exhibits strong adaptiveness under out-of-distribution data from a different continent, demonstrating the robustness and transferability of its learned representations. Our pre-trained model is publicly available at https://github.com/shiyu-coder/WindFM.
中文: WindFM是一种轻量级生成式基础模型,通过对离散化时序数据进行基于Transformer的自回归预训练,学习通用的风能动态表征,在概率性风电功率预测中实现了最先进的零样本性能。
English: WindFM is a lightweight generative foundation model that achieves state-of-the-art zero-shot performance in probabilistic wind power forecasting by learning universal wind dynamics representations through transformer-based autoregressive pre-training on discretized time-series data.

Authors:Jiangnan Xie, Xiaolong Zheng, Liang Zheng
Title: Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Abstract:
Visual Grounding (VG) aims to utilize given natural language queries to locate specific target objects within images. While current transformer-based approaches demonstrate strong localization performance in standard scene (i.e, scenarios without any novel objects), they exhibit notable limitations in open-vocabulary scene (i.e, both familiar and novel object categories during testing). These limitations primarily stem from three key factors: (1) imperfect alignment between visual and linguistic modalities, (2) insufficient cross-modal feature fusion, and (3) ineffective utilization of semantic prototype information. To overcome these challenges, we present Prototype-Aware Multimodal Learning (PAML), an innovative framework that systematically addresses these issues through several key components: First, we leverage ALBEF to establish robust cross-modal alignment during initial feature encoding. Subsequently, our Visual Discriminative Feature Encoder selectively enhances salient object representations while suppressing irrelevant visual context. The framework then incorporates a novel prototype discovering and inheriting mechanism that extracts and aggregates multi-neighbor semantic prototypes to facilitate open-vocabulary recognition. These enriched features undergo comprehensive multimodal integration through our Multi-stage Decoder before final bounding box regression. Extensive experiments across five benchmark datasets validate our approach, showing competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene. Our code is available at https://github.com/plankXie/PAML.
中文:提出的原型感知多模态学习(PAML)框架通过增强跨模态对齐、特征融合和语义原型利用,克服了开放词汇视觉定位中的局限性,实现了最先进的性能。
English: The proposed Prototype-Aware Multimodal Learning (PAML) framework overcomes limitations in open-vocabulary visual grounding by enhancing cross-modal alignment, feature fusion, and semantic prototype utilization, achieving state-of-the-art performance.

Authors:Xiangcheng Hu, Xieyuanli Chen, Mingkai Jia, Jin Wu, Ping Tan, Steven L. Waslander
Title: DCReg: Decoupled Characterization for Efficient Degenerate LiDAR Registration
Abstract:
LiDAR point cloud registration is fundamental to robotic perception and navigation. However, in geometrically degenerate or narrow environments, registration problems become ill-conditioned, leading to unstable solutions and degraded accuracy. While existing approaches attempt to handle these issues, they fail to address the core challenge: accurately detection, interpret, and resolve this ill-conditioning, leading to missed detections or corrupted solutions. In this study, we introduce DCReg, a principled framework that systematically addresses the ill-conditioned registration problems through three integrated innovations. First, DCReg achieves reliable ill-conditioning detection by employing a Schur complement decomposition to the hessian matrix. This technique decouples the registration problem into clean rotational and translational subspaces, eliminating coupling effects that mask degeneracy patterns in conventional analyses. Second, within these cleanly subspaces, we develop quantitative characterization techniques that establish explicit mappings between mathematical eigenspaces and physical motion directions, providing actionable insights about which specific motions lack constraints. Finally, leveraging this clean subspace, we design a targeted mitigation strategy: a novel preconditioner that selectively stabilizes only the identified ill-conditioned directions while preserving all well-constrained information in observable space. This enables efficient and robust optimization via the Preconditioned Conjugate Gradient method with a single physical interpretable parameter. Extensive experiments demonstrate DCReg achieves at least 20% - 50% improvement in localization accuracy and 5-100 times speedup over state-of-the-art methods across diverse environments. Our implementation will be available at https://github.com/JokerJohn/DCReg.
中文: DCReg框架通过舒尔补分解系统性地检测、表征并缓解退化环境中的LiDAR点云配准病态问题,在精度和速度上实现显著提升。
English: DCReg is a novel framework that systematically detects, characterizes, and mitigates ill-conditioned LiDAR registration in degenerate environments through Schur complement decomposition, achieving significant improvements in accuracy and speed.

Authors:Nitin Gupta, Bapi Dutta, Anupam Yadav
Title: An Explainable Framework for Particle Swarm Optimization using Landscape Analysis and Machine Learning
Abstract:
Swarm intelligence algorithms have demonstrated remarkable success in solving complex optimization problems across diverse domains. However, their widespread adoption is often hindered by limited transparency in how algorithmic components influence performance. This work presents a multi-faceted investigation of Particle Swarm Optimization (PSO) to further understand the key role of different topologies for better interpretability and explainability. To achieve this objective, we first develop a comprehensive landscape characterization framework using Exploratory Landscape Analysis (ELA) to quantify problem difficulty and identify critical features affecting the optimization performance of PSO. Next, we conduct a rigorous empirical study comparing three fundamental swarm communication architectures -- Ring, Star, and Von Neumann topologies -- analysing their distinct impacts on exploration-exploitation balance, convergence behaviour, and solution quality and eventually develop an explainable benchmarking framework for PSO, to decode how swarm topologies affects information flow, diversity, and convergence. Based on this, a novel machine learning approach for automated algorithm configuration is introduced for training predictive models on extensive Area over the Convergence Curve (AOCC) data to recommend optimal settings based on problem characteristics. Through systematic experimentation across twenty four benchmark functions in multiple dimensions, we establish practical guidelines for topology selection and parameter configuration. These findings advance the development of more transparent and reliable swarm intelligence systems. The source codes of this work can be accessed at https://github.com/GitNitin02/ioh_pso.
中文摘要:本研究通过分析不同群体拓扑结构对粒子群优化性能的影响,开发了可解释的基准测试框架和基于问题特征的自动算法配置机器学习方法,从而提升了算法的可解释性。
English Summary: This study enhances the interpretability of Particle Swarm Optimization by analyzing how different swarm topologies affect performance, developing an explainable benchmarking framework and a machine learning approach for automated algorithm configuration based on problem characteristics.

Authors:Honggang Jia, Xiucheng Wang, Nan Cheng, Ruijin Sun, Changle Li
Title: UrbanMIMOMap: A Ray-Traced MIMO CSI Dataset with Precoding-Aware Maps and Benchmarks
Abstract:
Sixth generation (6G) systems require environment-aware communication, driven by native artificial intelligence (AI) and integrated sensing and communication (ISAC). Radio maps (RMs), providing spatially continuous channel information, are key enablers. However, generating high-fidelity RM ground truth via electromagnetic (EM) simulations is computationally intensive, motivating machine learning (ML)-based RM construction. The effectiveness of these data-driven methods depends on large-scale, high-quality training data. Current public datasets often focus on single-input single-output (SISO) and limited information, such as path loss, which is insufficient for advanced multi-input multi-output (MIMO) systems requiring detailed channel state information (CSI). To address this gap, this paper presents UrbanMIMOMap, a novel large-scale urban MIMO CSI dataset generated using high-precision ray tracing. UrbanMIMOMap offers comprehensive complex CSI matrices across a dense spatial grid, going beyond traditional path loss data. This rich CSI is vital for constructing high-fidelity RMs and serves as a fundamental resource for data-driven RM generation, including deep learning. We demonstrate the dataset's utility through baseline performance evaluations of representative ML methods for RM construction. This work provides a crucial dataset and reference for research in high-precision RM generation, MIMO spatial performance, and ML for 6G environment awareness. The code and data for this work are available at: https://github.com/UNIC-Lab/UrbanMIMOMap.
中文摘要:本文提出UrbanMIMOMap这一基于射线追踪生成的大规模城市MIMO信道状态信息数据集,旨在弥补现有数据集的不足,为构建6G环境感知通信所需的高精度无线电地图提供关键数据支持。
English Summary: This paper introduces UrbanMIMOMap, a large-scale urban MIMO channel state information dataset generated via ray tracing to address the limitations of existing datasets and support high-fidelity radio map construction for 6G environment-aware communication systems.

Authors:Qin Yang, Nicholas Stout, Meisam Mohammady, Han Wang, Ayesha Samreen, Christopher J Quinn, Yan Yan, Ashish Kundu, Yuan Hong
Title: PLRV-O: Advancing Differentially Private Deep Learning via Privacy Loss Random Variable Optimization
Abstract:
Differentially Private Stochastic Gradient Descent (DP-SGD) is a standard method for enforcing privacy in deep learning, typically using the Gaussian mechanism to perturb gradient updates. However, conventional mechanisms such as Gaussian and Laplacian noise are parameterized only by variance or scale. This single degree of freedom ties the magnitude of noise directly to both privacy loss and utility degradation, preventing independent control of these two factors. The problem becomes more pronounced when the number of composition rounds T and batch size B vary across tasks, as these variations induce task-dependent shifts in the privacy-utility trade-off, where small changes in noise parameters can disproportionately affect model accuracy. To address this limitation, we introduce PLRV-O, a framework that defines a broad search space of parameterized DP-SGD noise distributions, where privacy loss moments are tightly characterized yet can be optimized more independently with respect to utility loss. This formulation enables systematic adaptation of noise to task-specific requirements, including (i) model size, (ii) training duration, (iii) batch sampling strategies, and (iv) clipping thresholds under both training and fine-tuning settings. Empirical results demonstrate that PLRV-O substantially improves utility under strict privacy constraints. On CIFAR-10, a fine-tuned ViT achieves 94.03% accuracy at epsilon approximately 0.5, compared to 83.93% with Gaussian noise. On SST-2, RoBERTa-large reaches 92.20% accuracy at epsilon approximately 0.2, versus 50.25% with Gaussian.
中文: 本文提出了PLRV-O框架,为差分隐私随机梯度下降定义了一个参数化噪声分布的广泛搜索空间,能够更独立地控制隐私损失和效用损失,从而根据任务特定需求系统调整噪声,在严格隐私约束下显著提升模型准确性。
English: This paper introduces PLRV-O, a framework that creates a search space for parameterized noise distributions in DP-SGD, enabling more independent control over privacy loss and utility degradation to systematically adapt noise to task-specific requirements and significantly improve model accuracy under strict privacy constraints.

Authors:Lucas Wojcik, Luiz Coelho, Roger Granada, David Menotti
Title: Exploring Light-Weight Object Recognition for Real-Time Document Detection
Abstract:
Object Recognition and Document Skew Estimation have come a long way in terms of performance and efficiency. New models follow one of two directions: improving performance using larger models, and improving efficiency using smaller models. However, real-time document detection and rectification is a niche that is largely unexplored by the literature, yet it remains a vital step for automatic information retrieval from visual documents. In this work, we strive towards an efficient document detection pipeline that is satisfactory in terms of Optical Character Recognition (OCR) retrieval and faster than other available solutions. We adapt IWPOD-Net, a license plate detection network, and train it for detection on NBID, a synthetic ID card dataset. We experiment with data augmentation and cross-dataset validation with MIDV (another synthetic ID and passport document dataset) to find the optimal scenario for the model. Other methods from both the Object Recognition and Skew Estimation state-of-the-art are evaluated for comparison with our approach. We use each method to detect and rectify the document, which is then read by an OCR system. The OCR output is then evaluated using a novel OCR quality metric based on the Levenshtein distance. Since the end goal is to improve automatic information retrieval, we use the overall OCR quality as a performance metric. We observe that with a promising model, document rectification does not have to be perfect to attain state-of-the-art performance scores. We show that our model is smaller and more efficient than current state-of-the-art solutions while retaining a competitive OCR quality metric. All code is available at https://github.com/BOVIFOCR/iwpod-doc-corners.git
中文: 本研究基于车牌检测网络开发了一种高效的文档检测与校正流程,在保持竞争力的OCR质量指标的同时,实现了比现有最优方案更轻量、更快速的性能表现。
English: This research introduces an efficient document detection and rectification pipeline adapted from a license plate detection network, achieving competitive OCR performance with a smaller, faster model than current state-of-the-art methods.

Authors:Olivier Schipper, Yudi Zhang, Yali Du, Mykola Pechenizkiy, Meng Fang
Title: PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments
Abstract:
LLM-based agents have shown promise in various cooperative and strategic reasoning tasks, but their effectiveness in competitive multi-agent environments remains underexplored. To address this gap, we introduce PillagerBench, a novel framework for evaluating multi-agent systems in real-time competitive team-vs-team scenarios in Minecraft. It provides an extensible API, multi-round testing, and rule-based built-in opponents for fair, reproducible comparisons. We also propose TactiCrafter, an LLM-based multi-agent system that facilitates teamwork through human-readable tactics, learns causal dependencies, and adapts to opponent strategies. Our evaluation demonstrates that TactiCrafter outperforms baseline approaches and showcases adaptive learning through self-play. Additionally, we analyze its learning process and strategic evolution over multiple game episodes. To encourage further research, we have open-sourced PillagerBench, fostering advancements in multi-agent AI for competitive environments.
中文: PillagerBench是一个用于在竞争性《我的世界》场景中评估多智能体系统的新框架,而TactiCrafter则是一个基于大语言模型的系统,通过增强团队协作和适应对手策略,在自适应学习中超越了基线方法。
English: PillagerBench is a novel framework for evaluating multi-agent systems in competitive Minecraft scenarios, while TactiCrafter is an LLM-based system that enhances teamwork and adapts to opponents, outperforming baselines through adaptive learning.

Authors:Amna Hassan, Ilsa Afzaal, Nouman Muneeb, Aneeqa Batool, Hamail Noor
Title: AI-Based Applied Innovation for Fracture Detection in X-rays Using Custom CNN and Transfer Learning Models
Abstract:
Bone fractures present a major global health challenge, often resulting in pain, reduced mobility, and productivity loss, particularly in low-resource settings where access to expert radiology services is limited. Conventional imaging methods suffer from high costs, radiation exposure, and dependency on specialized interpretation. To address this, we developed an AI-based solution for automated fracture detection from X-ray images using a custom Convolutional Neural Network (CNN) and benchmarked it against transfer learning models including EfficientNetB0, MobileNetV2, and ResNet50. Training was conducted on the publicly available FracAtlas dataset, comprising 4,083 anonymized musculoskeletal radiographs. The custom CNN achieved 95.96% accuracy, 0.94 precision, 0.88 recall, and an F1-score of 0.91 on the FracAtlas dataset. Although transfer learning models (EfficientNetB0, MobileNetV2, ResNet50) performed poorly in this specific setup, these results should be interpreted in light of class imbalance and data set limitations. This work highlights the promise of lightweight CNNs for detecting fractures in X-rays and underscores the importance of fair benchmarking, diverse datasets, and external validation for clinical translation
中文: 本研究开发了一种定制CNN模型,在X光骨折自动检测中达到95.96%准确率,尽管迁移学习模型表现欠佳,但证明了人工智能在解决资源匮乏地区诊断难题方面的潜力。
English: This study developed a custom CNN model achieving 95.96% accuracy for automated fracture detection in X-rays, demonstrating AI's potential to address diagnostic challenges in resource-limited settings despite limitations in transfer learning models.

Authors:Vedran Novaković
Title: Recursive vectorized computation of the Frobenius norm
Abstract:
Recursive algorithms for computing the Frobenius norm of a real array are proposed, based on hypot, a hypotenuse function. Comparing their relative accuracy bounds with those of the BLAS routine DNRM2 it is shown that the proposed algorithms could in many cases be significantly more accurate. The scalar recursive algorithms are vectorized with the Intel's vector instructions to achieve performance comparable to xNRM2, and are further parallelized with OpenCilk. Some scalar algorithms are unconditionally bitwise reproducible, while the reproducibility of the vector ones depends on the vector width.
中文: 提出的基于 hypot 函数的递归算法计算 Frobenius 范数,在许多情况下比 BLAS DNRM2 程序精度更高,并通过英特尔向量指令和 OpenCilk 并行化实现高性能,同时部分算法具备无条件比特级可重现性。
English: The proposed recursive algorithms for computing the Frobenius norm using the hypot function demonstrate significantly higher accuracy than the BLAS DNRM2 routine in many cases, and are optimized for performance through vectorization with Intel instructions and parallelization with OpenCilk, while ensuring bitwise reproducibility under certain conditions.

Authors:Hao Liang, Ruitao Wu, Bohan Zeng, Junbo Niu, Wentao Zhang, Bin Dong
Title: Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge
Abstract:
Multimodal reasoning remains a fundamental challenge in artificial intelligence. Despite substantial advances in text-based reasoning, even state-of-the-art models such as GPT-o3 struggle to maintain strong performance in multimodal scenarios. To address this gap, we introduce a caption-assisted reasoning framework that effectively bridges visual and textual modalities. Our approach achieved 1st place in the ICML 2025 AI for Math Workshop \& Challenge 2: SeePhys, highlighting its effectiveness and robustness. Furthermore, we validate its generalization on the MathVerse benchmark for geometric reasoning, demonstrating the versatility of our method. Our code is publicly available at https://github.com/OpenDCAI/SciReasoner.
中文摘要:作者提出了一种字幕辅助推理框架来增强多模态推理能力,该方法在ICML 2025 SeePhys挑战赛中荣获第一,并在MathVerse基准测试中展现出优秀的泛化性能。
English Summary: The authors propose a caption-assisted reasoning framework to enhance multimodal reasoning, achieving top performance in the ICML 2025 SeePhys challenge and demonstrating strong generalization on the MathVerse benchmark.

Authors:Zhenqi Jia, Rui Liu, Berrak Sisman, Haizhou Li
Title: Multimodal Fine-grained Context Interaction Graph Modeling for Conversational Speech Synthesis
Abstract:
Conversational Speech Synthesis (CSS) aims to generate speech with natural prosody by understanding the multimodal dialogue history (MDH). The latest work predicts the accurate prosody expression of the target utterance by modeling the utterance-level interaction characteristics of MDH and the target utterance. However, MDH contains fine-grained semantic and prosody knowledge at the word level. Existing methods overlook the fine-grained semantic and prosodic interaction modeling. To address this gap, we propose MFCIG-CSS, a novel Multimodal Fine-grained Context Interaction Graph-based CSS system. Our approach constructs two specialized multimodal fine-grained dialogue interaction graphs: a semantic interaction graph and a prosody interaction graph. These two interaction graphs effectively encode interactions between word-level semantics, prosody, and their influence on subsequent utterances in MDH. The encoded interaction features are then leveraged to enhance synthesized speech with natural conversational prosody. Experiments on the DailyTalk dataset demonstrate that MFCIG-CSS outperforms all baseline models in terms of prosodic expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/MFCIG-CSS.
中文:MFCIG-CSS通过构建多模态细粒度交互图来建模对话历史中的词级语义和韵律影响,在韵律表现力上显著优于现有基线模型。
English: MFCIG-CSS introduces multimodal fine-grained interaction graphs to model word-level semantic and prosodic influences in dialogue history, significantly enhancing conversational speech prosody over existing methods.

Authors:Fei Wang, Yujie Li, Zezhi Shao, Chengqing Yu, Yisong Fu, Zhulin An, Yongjun Xu, Xueqi Cheng
Title: ARIES: Relation Assessment and Model Recommendation for Deep Time Series Forecasting
Abstract:
Recent advancements in deep learning models for time series forecasting have been significant. These models often leverage fundamental time series properties such as seasonality and non-stationarity, which may suggest an intrinsic link between model performance and data properties. However, existing benchmark datasets fail to offer diverse and well-defined temporal patterns, restricting the systematic evaluation of such connections. Additionally, there is no effective model recommendation approach, leading to high time and cost expenditures when testing different architectures across different downstream applications. For those reasons, we propose ARIES, a framework for assessing relation between time series properties and modeling strategies, and for recommending deep forcasting models for realistic time series. First, we construct a synthetic dataset with multiple distinct patterns, and design a comprehensive system to compute the properties of time series. Next, we conduct an extensive benchmarking of over 50 forecasting models, and establish the relationship between time series properties and modeling strategies. Our experimental results reveal a clear correlation. Based on these findings, we propose the first deep forecasting model recommender, capable of providing interpretable suggestions for real-world time series. In summary, ARIES is the first study to establish the relations between the properties of time series data and modeling strategies, while also implementing a model recommendation system. The code is available at: https://github.com/blisky-li/ARIES.
Chinese: ARIES框架通过全面基准测试确立了时间序列特性与建模策略之间的明确关联,并推出了首个可解释的深度预测模型推荐系统,以解决现有数据集和评估方法的局限性。
English: The ARIES framework establishes a clear correlation between time series properties and modeling strategies through comprehensive benchmarking and introduces the first interpretable deep forecasting model recommender to address the limitations of existing datasets and evaluation methods.

Authors:Xinyu Gao, Xiangtao Meng, Yingkai Dong, Zheng Li, Shanqing Guo
Title: DCMI: A Differential Calibration Membership Inference Attack Against Retrieval-Augmented Generation
Abstract:
While Retrieval-Augmented Generation (RAG) effectively reduces hallucinations by integrating external knowledge bases, it introduces vulnerabilities to membership inference attacks (MIAs), particularly in systems handling sensitive data. Existing MIAs targeting RAG's external databases often rely on model responses but ignore the interference of non-member-retrieved documents on RAG outputs, limiting their effectiveness. To address this, we propose DCMI, a differential calibration MIA that mitigates the negative impact of non-member-retrieved documents. Specifically, DCMI leverages the sensitivity gap between member and non-member retrieved documents under query perturbation. It generates perturbed queries for calibration to isolate the contribution of member-retrieved documents while minimizing the interference from non-member-retrieved documents. Experiments under progressively relaxed assumptions show that DCMI consistently outperforms baselines--for example, achieving 97.42% AUC and 94.35% Accuracy against the RAG system with Flan-T5, exceeding the MBA baseline by over 40%. Furthermore, on real-world RAG platforms such as Dify and MaxKB, DCMI maintains a 10%-20% advantage over the baseline. These results highlight significant privacy risks in RAG systems and emphasize the need for stronger protection mechanisms. We appeal to the community's consideration of deeper investigations, like ours, against the data leakage risks in rapidly evolving RAG systems. Our code is available at https://github.com/Xinyu140203/RAG_MIA.
中文: 检索增强生成(RAG)系统因非成员文档的干扰易受成员推理攻击,为此提出的差分校准方法DCMI能有效分离成员贡献,在准确率和隐私风险防控上显著优于现有基线方案。
English: Retrieval-Augmented Generation (RAG) systems are vulnerable to membership inference attacks due to interference from non-member documents, prompting the development of DCMI, a differential calibration method that effectively isolates member contributions and significantly outperforms existing baselines in both accuracy and privacy risk mitigation.

Authors:Haoyang He, Zihua Rong, Kun Ji, Chenyang Li, Qing Huang, Chong Xia, Lan Yang, Honggang Zhang
Title: Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL
Abstract:
Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model's genuine reasoning capacity. We propose Dynamic Reasoning Efficiency Reward (DRER) -- a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A Reasoning Quality Reward assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A Dynamic Length Advantage decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release Logictree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments confirm the effectiveness of DRER: our 7B model attains GPT-o3-mini level performance on Logictree with 400 trianing steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in repository https://github.com/Henryhe09/DRER.
中文: 提出的动态推理效率奖励(DRER)框架通过激励有益的思维链标记和动态长度调整来增强大型语言模型的推理能力,在逻辑推理和数学基准测试中达到GPT-o3-mini水平性能,并显著提升答案置信度与泛化能力。
English: The proposed Dynamic Reasoning Efficiency Reward (DRER) framework enhances reasoning in large language models by incentivizing beneficial Chain-of-Thought tokens and stabilizing training through dynamic length adjustments, achieving GPT-o3-mini level performance with improved confidence and generalization across logical and mathematical benchmarks.

Authors:Zhiwen Shao, Yifan Cheng, Fan Zhang, Xuehuai Shi, Canlin Li, Lizhuang Ma, Dit-yan Yeung
Title: Micro-Expression Recognition via Fine-Grained Dynamic Perception
Abstract:
Facial micro-expression recognition (MER) is a challenging task, due to the transience, subtlety, and dynamics of micro-expressions (MEs). Most existing methods resort to hand-crafted features or deep networks, in which the former often additionally requires key frames, and the latter suffers from small-scale and low-diversity training data. In this paper, we develop a novel fine-grained dynamic perception (FDP) framework for MER. We propose to rank frame-level features of a sequence of raw frames in chronological order, in which the rank process encodes the dynamic information of both ME appearances and motions. Specifically, a novel local-global feature-aware transformer is proposed for frame representation learning. A rank scorer is further adopted to calculate rank scores of each frame-level feature. Afterwards, the rank features from rank scorer are pooled in temporal dimension to capture dynamic representation. Finally, the dynamic representation is shared by a MER module and a dynamic image construction module, in which the former predicts the ME category, and the latter uses an encoder-decoder structure to construct the dynamic image. The design of dynamic image construction task is beneficial for capturing facial subtle actions associated with MEs and alleviating the data scarcity issue. Extensive experiments show that our method (i) significantly outperforms the state-of-the-art MER methods, and (ii) works well for dynamic image construction. Particularly, our FDP improves by 4.05%, 2.50%, 7.71%, and 2.11% over the previous best results in terms of F1-score on the CASME II, SAMM, CAS(ME)^2, and CAS(ME)^3 datasets, respectively. The code is available at https://github.com/CYF-cuber/FDP.
中文: 本文提出了一种细粒度动态感知框架,通过排序帧级特征并利用变换器捕捉动态表示,显著提升了面部微表情识别的性能,在多个数据集上取得了最优结果。
English: This paper introduces a fine-grained dynamic perception framework that enhances facial micro-expression recognition by ranking frame-level features and using a transformer for dynamic representation, achieving state-of-the-art results across multiple datasets.

Authors:Wanyin Cheng, Zanxi Ruan
Title: BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users
Abstract:
Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions-posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at https://github.com/Accecwan/BLaVe-CoT.
中文摘要:BLaVe-CoT是一种新型视觉问答框架,通过生成多样化答案、空间定位及思维链推理来应对盲人和低视力用户的模糊提问,在基准测试中优于现有方法,提升了辅助技术的包容性。
English Summary: BLaVe-CoT is a novel VQA framework that addresses ambiguity in questions from blind and low vision users by generating diverse answers, grounding them spatially, and using chain-of-thought reasoning to evaluate answer consistency, outperforming previous methods on benchmarks.

Authors:Jeonghyun Noh, Wangsu Jeon, Jinsun Park
Title: Dual Interaction Network with Cross-Image Attention for Medical Image Segmentation
Abstract:
Medical image segmentation is a crucial method for assisting professionals in diagnosing various diseases through medical imaging. However, various factors such as noise, blurriness, and low contrast often hinder the accurate diagnosis of diseases. While numerous image enhancement techniques can mitigate these issues, they may also alter crucial information needed for accurate diagnosis in the original image. Conventional image fusion strategies, such as feature concatenation can address this challenge. However, they struggle to fully leverage the advantages of both original and enhanced images while suppressing the side effects of the enhancements. To overcome the problem, we propose a dual interactive fusion module (DIFM) that effectively exploits mutual complementary information from the original and enhanced images. DIFM employs cross-attention bidirectionally to simultaneously attend to corresponding spatial information across different images, subsequently refining the complementary features via global spatial attention. This interaction leverages low- to high-level features implicitly associated with diverse structural attributes like edges, blobs, and object shapes, resulting in enhanced features that embody important spatial characteristics. In addition, we introduce a multi-scale boundary loss based on gradient extraction to improve segmentation accuracy at object boundaries. Experimental results on the ACDC and Synapse datasets demonstrate the superiority of the proposed method quantitatively and qualitatively. Code available at: https://github.com/JJeong-Gari/DIN
中文: 提出的双重交互融合模块(DIFM)通过交叉注意力和全局空间注意力有效整合原始与增强医学图像的互补信息,结合多尺度边界损失提升物体边界分割精度,在基准数据集上展现出优越性能。
English: The proposed dual interactive fusion module (DIFM) effectively integrates complementary information from original and enhanced medical images using cross-attention and global spatial attention, while a multi-scale boundary loss improves segmentation accuracy at object boundaries, demonstrating superior performance on benchmark datasets.

Authors:Feng Wang, Zihao Yu
Title: Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
Abstract:
Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS
中文摘要:强化学习虽能提升流匹配模型的图像生成质量,但随机采样会引入噪声伪影;我们提出的系数保持采样方法可消除此类噪声,从而改进奖励建模并加速训练收敛。
English Summary: Reinforcement Learning enhances image generation in Flow Matching models but introduces noise artifacts through stochastic sampling, which our proposed Coefficients-Preserving Sampling method eliminates to improve reward modeling and training convergence.

Authors:Feng Wang, Zihao Yu
Title: Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching
Abstract:
Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS
中文摘要:强化学习虽能提升流匹配模型的图像生成质量,但随机采样会引入噪声伪影;我们提出的系数保持采样方法可消除此类噪声,从而改进奖励建模并加速训练收敛。
English Summary: Reinforcement Learning enhances image generation in Flow Matching models but introduces noise artifacts through stochastic sampling, which our proposed Coefficients-Preserving Sampling method eliminates to improve reward modeling and training convergence.

Authors:Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Shaowu Pan, Min-Ling Zhang
Title: Code2MCP: A Multi-Agent Framework for Automated Transformation of Code Repositories into Model Context Protocol Services
Abstract:
The proliferation of Large Language Models (LLMs) has created a significant integration challenge in the AI agent ecosystem, often called the "$N \times M$ problem," where N models require custom integrations for M tools. This fragmentation stifles innovation and creates substantial development overhead. While the Model Context Protocol (MCP) has emerged as a standard to resolve this, its adoption is hindered by the manual effort required to convert the vast universe of existing software into MCP-compliant services. This is especially true for the millions of open-source repositories on GitHub, the world's largest collection of functional code. This paper introduces Code2MCP, a highly automated, agentic framework designed to transform any GitHub repository into a functional MCP service with minimal human intervention. Our system employs a multi-stage workflow that automates the entire process, from code analysis and environment configuration to service generation and deployment. A key innovation of our framework is an LLM-driven, closed-loop "Run--Review--Fix" cycle, which enables the system to autonomously debug and repair the code it generates. Code2MCP produces not only deployable services but also comprehensive technical documentation, acting as a catalyst to accelerate the MCP ecosystem by systematically unlocking the world's largest open-source code repository and automating the critical last mile of tool integration. The code is open-sourced at https://github.com/DEFENSE-SEU/MCP-Github-Agent.
中文: Code2MCP是一个自动化框架,可将GitHub代码库转化为MCP兼容服务,填补了工具池构建的空白,有力推动了模型上下文协议的实际应用。
English: Code2MCP is an automated framework that converts GitHub repositories into MCP-compatible services, addressing the gap in populating tool pools and accelerating the adoption of the Model Context Protocol.

Authors:Chaoqian Ouyang, Ling Yue, Shimin Di, Libin Zheng, Linan Yue, Shaowu Pan, Jian Yin, Min-Ling Zhang
Title: Code2MCP: Transforming Code Repositories into MCP Services
Abstract:
The Model Context Protocol (MCP) aims to create a standard for how Large Language Models use tools. However, most current research focuses on selecting tools from an existing pool. A more fundamental, yet largely overlooked, problem is how to populate this pool by converting the vast number of existing software projects into MCP-compatible services. To bridge this gap, we introduce Code2MCP, an agent-based framework that automatically transforms a GitHub repository into a functional MCP service with minimal human intervention. Code2MCP employs a multi-agent workflow for code analysis, environment setup, tool function design, and service generation, enhanced by a self-correcting loop to ensure reliability. We demonstrate that Code2MCP successfully transforms open-source computing libraries in scientific fields such as bioinformatics, mathematics, and fluid dynamics that are not available in existing MCP servers. By providing a novel automated pathway to unlock GitHub, the world's largest code repository, for the MCP ecosystem, Code2MCP serves as a catalyst to significantly accelerate the protocol's adoption and practical application. The code is public at https://github.com/DEFENSE-SEU/Code2MCP.
中文: Code2MCP是一个自动化框架,可将GitHub代码库转化为MCP兼容服务,填补了工具池构建的空白,有力推动了模型上下文协议的实际应用。
English: Code2MCP is an automated framework that converts GitHub repositories into MCP-compatible services, addressing the gap in populating tool pools and accelerating the adoption of the Model Context Protocol.

Authors:Md Hasebul Hasan, Mahir Labib Dihan, Mohammed Eunus Ali, Md Rizwan Parvez
Title: MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration
Abstract:
Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly-often overwhelming the LLM when handling similar but subtly different geospatial APIs-MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules-such as map-based services-we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks-MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA-and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines. We open-source our framwork at https://github.com/Hasebul/MapAgent.
Chinese Summary: MapAgent是一种分层多智能体框架,通过将规划与执行解耦并采用专业化模块和自适应工具协调,显著提升了地理空间推理能力,在多项基准测试中优于现有先进方法。
English Summary: MapAgent is a hierarchical multi-agent framework designed to enhance geospatial reasoning by decoupling planning from execution, using specialized modules and adaptive tool coordination to outperform existing approaches on diverse benchmarks.

Authors:Shuolong Chen, Xingxing Li, Liu Yuan
Title: eKalibr-Inertial: Continuous-Time Spatiotemporal Calibration for Event-Based Visual-Inertial Systems
Abstract:
The bioinspired event camera, distinguished by its exceptional temporal resolution, high dynamic range, and low power consumption, has been extensively studied in recent years for motion estimation, robotic perception, and object detection. In ego-motion estimation, the visual-inertial setup is commonly adopted due to complementary characteristics between sensors (e.g., scale perception and low drift). For optimal event-based visual-inertial fusion, accurate spatiotemporal (extrinsic and temporal) calibration is required. In this work, we present eKalibr-Inertial, an accurate spatiotemporal calibrator for event-based visual-inertial systems, utilizing the widely used circle grid board. Building upon the grid pattern recognition and tracking methods in eKalibr and eKalibr-Stereo, the proposed method starts with a rigorous and efficient initialization, where all parameters in the estimator would be accurately recovered. Subsequently, a continuous-time-based batch optimization is conducted to refine the initialized parameters toward better states. The results of extensive real-world experiments show that eKalibr-Inertial can achieve accurate event-based visual-inertial spatiotemporal calibration. The implementation of eKalibr-Inertial is open-sourced at (https://github.com/Unsigned-Long/eKalibr) to benefit the research community.
Chinese: 本文提出eKalibr-Inertial,一种基于事件相机的视觉-惯性系统开源时空标定方法,通过严格初始化和连续时间批量优化实现精确参数估计。
English: This paper introduces eKalibr-Inertial, an open-source spatiotemporal calibration method for event-based visual-inertial systems that achieves accurate parameter estimation through rigorous initialization and continuous-time batch optimization.

Authors:Tyler Ward, Abdullah Imran
Title: A Probabilistic Segment Anything Model for Ambiguity-Aware Medical Image Segmentation
Abstract:
Recent advances in promptable segmentation, such as the Segment Anything Model (SAM), have enabled flexible, high-quality mask generation across a wide range of visual domains. However, SAM and similar models remain fundamentally deterministic, producing a single segmentation per object per prompt, and fail to capture the inherent ambiguity present in many real-world tasks. This limitation is particularly troublesome in medical imaging, where multiple plausible segmentations may exist due to annotation uncertainty or inter-expert variability. In this paper, we introduce Probabilistic SAM, a probabilistic extension of SAM that models a distribution over segmentations conditioned on both the input image and prompt. By incorporating a latent variable space and training with a variational objective, our model learns to generate diverse and plausible segmentation masks reflecting the variability in human annotations. The architecture integrates a prior and posterior network into the SAM framework, allowing latent codes to modulate the prompt embeddings during inference. The latent space allows for efficient sampling during inference, enabling uncertainty-aware outputs with minimal overhead. We evaluate Probabilistic SAM on the public LIDC-IDRI lung nodule dataset and demonstrate its ability to produce diverse outputs that align with expert disagreement, outperforming existing probabilistic baselines on uncertainty-aware metrics. Our code is available at: https://github.com/tbwa233/Probabilistic-SAM/.
中文: 针对SAM模型确定性分割的局限,本文提出概率SAM,通过引入潜在变量和变分训练生成多样化分割掩码,有效反映医学图像中专家标注的不确定性,并在公开数据集上验证了其优越性。
English: The Segment Anything Model (SAM) is deterministic and fails to capture segmentation ambiguity, so Probabilistic SAM is introduced as an extension that models a distribution over segmentations to generate diverse, uncertainty-aware masks, particularly benefiting medical imaging.

Authors:Zijian Chen, Wenjie Hua, Jinhao Li, Lirong Deng, Fan Du, Tingzhu Chen, Guangtao Zhai
Title: PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters
Abstract:
Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity's early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.
中文摘要:本文提出了PictOBI-20k数据集,用于评估大型多模态模型对甲骨文的视觉解读能力,发现现有模型具备初步解读技能但主要受限于语言先验而非视觉信息。
English Summary: This paper introduces PictOBI-20k, a dataset for evaluating large multimodal models' ability to visually decipher oracle bone characters, revealing that current models possess basic skills but rely more on language priors than visual information.

Authors:Jinkun Geng, Shuai Mu, Anirudh Sivaraman, Balaji Prabhakar
Title: Tiga: Accelerating Geo-Distributed Transactions with Synchronized Clocks [Technical Report]
Abstract:
This paper presents Tiga, a new design for geo-replicated and scalable transactional databases such as Google Spanner. Tiga aims to commit transactions within 1 wide-area roundtrip time, or 1 WRTT, for a wide range of scenarios, while maintaining high throughput with minimal computational overhead. Tiga consolidates concurrency control and consensus, completing both strictly serializable execution and consistent replication in a single round. It uses synchronized clocks to proactively order transactions by assigning each a future timestamp at submission. In most cases, transactions arrive at servers before their future timestamps and are serialized according to the designated timestamp, requiring 1 WRTT to commit. In rare cases, transactions are delayed and proactive ordering fails, in which case Tiga falls back to a slow path, committing in 1.5--2 WRTTs. Compared to state-of-the-art solutions, Tiga can commit more transactions at 1-WRTT latency, and incurs much less throughput overhead. Evaluation results show that Tiga outperforms all baselines, achieving 1.3--7.2$\times$ higher throughput and 1.4--4.6$\times$ lower latency. Tiga is open-sourced at https://github.com/New-Consensus-Concurrency-Control/Tiga.
中文: Tiga 是一种地理复制的可扩展事务数据库设计,通过同步时钟主动分配时间戳,在单次广域往返时间(1 WRTT)内提交事务,相比现有方案实现了更高的吞吐量和更低的延迟。
English: Tiga is a geo-replicated transactional database design that commits transactions in one wide-area roundtrip time (1 WRTT) using synchronized clocks for proactive ordering, achieving higher throughput and lower latency than existing solutions.

Authors:Sarang Patil, Zeyong Zhang, Yiran Huang, Tengfei Ma, Mengjia Xu
Title: Hyperbolic Large Language Models
Abstract:
Large language models (LLMs) have achieved remarkable success and demonstrated superior performance across various tasks, including natural language processing (NLP), weather forecasting, biological protein folding, text generation, and solving mathematical problems. However, many real-world data exhibit highly non-Euclidean latent hierarchical anatomy, such as protein networks, transportation networks, financial networks, brain networks, and linguistic structures or syntactic trees in natural languages. Effectively learning intrinsic semantic entailment and hierarchical relationships from these raw, unstructured input data using LLMs remains an underexplored area. Due to its effectiveness in modeling tree-like hierarchical structures, hyperbolic geometry -- a non-Euclidean space -- has rapidly gained popularity as an expressive latent representation space for complex data modeling across domains such as graphs, images, languages, and multi-modal data. Here, we provide a comprehensive and contextual exposition of recent advancements in LLMs that leverage hyperbolic geometry as a representation space to enhance semantic representation learning and multi-scale reasoning. Specifically, the paper presents a taxonomy of the principal techniques of Hyperbolic LLMs (HypLLMs) in terms of four main categories: (1) hyperbolic LLMs through exp/log maps; (2) hyperbolic fine-tuned models; (3) fully hyperbolic LLMs, and (4) hyperbolic state-space models. We also explore crucial potential applications and outline future research directions. A repository of key papers, models, datasets, and code implementations is available at https://github.com/sarangp2402/Hyperbolic-LLM-Models/tree/main.
中文: 大型语言模型正越来越多地利用双曲几何来更好地捕捉复杂数据中的层次结构,其最新进展分为四大类技术,并展现出广阔的应用前景。
English: Large language models are increasingly leveraging hyperbolic geometry to better capture hierarchical structures in complex data, with recent advances categorized into four main techniques and promising applications.

Authors:Jiaqi Chen, Ji Shi, Cansu Sancaktar, Jonas Frey, Georg Martius
Title: Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies
Abstract:
Data collection is crucial for learning robust world models in model-based reinforcement learning. The most prevalent strategies are to actively collect trajectories by interacting with the environment during online training or training on offline datasets. At first glance, the nature of learning task-agnostic environment dynamics makes world models a good candidate for effective offline training. However, the effects of online vs. offline data on world models and thus on the resulting task performance have not been thoroughly studied in the literature. In this work, we investigate both paradigms in model-based settings, conducting experiments on 31 different environments. First, we showcase that online agents outperform their offline counterparts. We identify a key challenge behind performance degradation of offline agents: encountering Out-Of-Distribution states at test time. This issue arises because, without the self-correction mechanism in online agents, offline datasets with limited state space coverage induce a mismatch between the agent's imagination and real rollouts, compromising policy training. We demonstrate that this issue can be mitigated by allowing for additional online interactions in a fixed or adaptive schedule, restoring the performance of online training with limited interaction data. We also showcase that incorporating exploration data helps mitigate the performance degradation of offline agents. Based on our insights, we recommend adding exploration data when collecting large datasets, as current efforts predominantly focus on expert data alone.
Chinese: 在线智能体在基于模型的强化学习中优于离线智能体,因为后者面临分布外状态的挑战,但通过引入在线交互或探索数据可以有效缓解这一问题。
English: Online agents outperform offline ones in model-based reinforcement learning due to the latter's struggle with Out-Of-Distribution states, but this can be mitigated by incorporating online interactions or exploration data.

Authors:Liansheng Wang, Xinke Zhang, Chenhui Li, Dongjiao He, Yihan Pan, Jianjun Yi
Title: Super-LIO: A Robust and Efficient LiDAR-Inertial Odometry System with a Compact Mapping Strategy
Abstract:
LiDAR-Inertial Odometry (LIO) is a foundational technique for autonomous systems, yet its deployment on resource-constrained platforms remains challenging due to computational and memory limitations. We propose Super-LIO, a robust LIO system that demands both high performance and accuracy, ideal for applications such as aerial robots and mobile autonomous systems. At the core of Super-LIO is a compact octo-voxel-based map structure, termed OctVox, that limits each voxel to eight fused subvoxels, enabling strict point density control and incremental denoising during map updates. This design enables a simple yet efficient and accurate map structure, which can be easily integrated into existing LIO frameworks. Additionally, Super-LIO designs a heuristic-guided KNN strategy (HKNN) that accelerates the correspondence search by leveraging spatial locality, further reducing runtime overhead. We evaluated the proposed system using four publicly available datasets and several self-collected datasets, totaling more than 30 sequences. Extensive testing on both X86 and ARM platforms confirms that Super-LIO offers superior efficiency and robustness, while maintaining competitive accuracy. Super-LIO processes each frame approximately 73% faster than SOTA, while consuming less CPU resources. The system is fully open-source and plug-and-play compatible with a wide range of LiDAR sensors and platforms. The implementation is available at: https://github.com/Liansheng-Wang/Super-LIO.git
中文摘要:Super-LIO是一种高效的激光雷达惯性里程计系统,采用紧凑的OctVox地图结构和启发式KNN策略,在多个平台上处理速度比现有技术快73%,同时保持高精度。
English Summary: Super-LIO is an efficient LiDAR-Inertial Odometry system featuring a compact OctVox map structure and heuristic-guided KNN strategy, achieving 73% faster processing than state-of-the-art methods while maintaining high accuracy across multiple platforms.

Authors:Gašper Podobnik, Tomaž Vrtovec
Title: MeshMetrics: A Precise Implementation of Distance-Based Image Segmentation Metrics
Abstract:
The surge of research in image segmentation has yielded remarkable performance gains but also exposed a reproducibility crisis. A major contributor is performance evaluation, where both selection and implementation of metrics play critical roles. While recent efforts have improved the former, the reliability of metric implementation has received far less attention. Pitfalls in distance-based metric implementation can lead to considerable discrepancies between common open-source tools, for instance, exceeding 100 mm for the Hausdorff distance and 30%pt for the normalized surface distance for the same pair of segmentations. To address these pitfalls, we introduce MeshMetrics, a mesh-based framework that provides a more precise computation of distance-based metrics than conventional grid-based approaches. Through theoretical analysis and empirical validation, we demonstrate that MeshMetrics achieves higher accuracy and precision than established tools, and is substantially less affected by discretization artifacts, such as distance quantization. We release MeshMetrics as an open-source Python package, available at https://github.com/gasperpodobnik/MeshMetrics.
中文: 摘要指出图像分割领域因度量实现不可靠而面临可重复性危机,并提出了MeshMetrics这一基于网格的框架,相比传统基于网格的方法能更精确地计算基于距离的度量。
English: The abstract highlights a reproducibility crisis in image segmentation due to unreliable metric implementations and introduces MeshMetrics, a mesh-based framework that ensures more accurate computation of distance-based metrics than traditional grid-based methods.

Authors:Xiaomeng Zhu, Changwei Wang, Haozhe Wang, Xinyu Liu, Fangzhen Lin
Title: OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation
Abstract:
A scene graph is a structured represention of objects and their relationships in a scene. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications as intelligent surveillance and human-machine collaboration. Existing SGA approaches primarily leverage visual cues, often struggling to integrate valuable commonsense knowledge, thereby limiting long-term prediction robustness. To explicitly leverage such commonsense knowledge, we propose a new approach to better understand the objects, concepts, and relationships in a scene graph. Our approach decouples the SGA task in two steps: first a scene graph capturing model is used to convert a video clip into a sequence of scene graphs, then a pure text-based model is used to predict scene graphs in future frames. Our focus in this work is on the second step, and we call it Linguistic Scene Graph Anticipation (LSGA) and believes it should have independent interest beyond the use in SGA discussed here. For LSGA, we introduce an Object-Oriented Two-Staged Method (OOTSM) where an Large Language Model (LLM) first forecasts object appearances and disappearances before generating detailed human-object relations. We conduct extensive experiments to evaluate OOTSM in two settings. For LSGA, we evaluate our fine-tuned open-sourced LLMs against zero-shot APIs (i.e., GPT-4o, GPT-4o-mini, and DeepSeek-V3) on a benchmark constructed from Action Genome annotations. For SGA, we combine our OOTSM with STTran++ from, and our experiments demonstrate effective state-of-the-art performance: short-term mean-Recall (@10) increases by 3.4% while long-term mean-Recall (@50) improves dramatically by 21.9%. Code is available at https://github.com/ZhuXMMM/OOTSM.
中文: 本文提出了一种语言驱动的场景图预测方法,将任务解耦为视觉场景图提取和基于大语言模型的文本预测两个阶段,在长期预测方面实现了显著性能提升。
English: This paper introduces a linguistic approach to scene graph anticipation that decouples the task into visual scene graph extraction followed by text-based future prediction using large language models, achieving significant performance improvements particularly in long-term forecasting.

Authors:Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li
Title: LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding
Abstract:
Recent progress in Large Language Models (LLMs) has opened new avenues for solving complex optimization problems, including Neural Architecture Search (NAS). However, existing LLM-driven NAS approaches rely heavily on prompt engineering and domain-specific tuning, limiting their practicality and scalability across diverse tasks. In this work, we propose LM-Searcher, a novel framework that leverages LLMs for cross-domain neural architecture optimization without the need for extensive domain-specific adaptation. Central to our approach is NCode, a universal numerical string representation for neural architectures, which enables cross-domain architecture encoding and search. We also reformulate the NAS problem as a ranking task, training LLMs to select high-performing architectures from candidate pools using instruction-tuning samples derived from a novel pruning-based subspace sampling strategy. Our curated dataset, encompassing a wide range of architecture-performance pairs, encourages robust and transferable learning. Comprehensive experiments demonstrate that LM-Searcher achieves competitive performance in both in-domain (e.g., CNNs for image classification) and out-of-domain (e.g., LoRA configurations for segmentation and generation) tasks, establishing a new paradigm for flexible and generalizable LLM-based architecture search. The datasets and models will be released at https://github.com/Ashone3/LM-Searcher.
中文: LM-Searcher提出了一种新颖框架,利用大型语言模型进行跨领域神经架构优化,通过通用数值编码和将NAS重新定义为排序任务,无需大量领域特定调整即可在多种任务中实现优异性能。
English: LM-Searcher introduces a novel framework using Large Language Models for cross-domain neural architecture optimization, employing a universal numerical encoding and reformulating NAS as a ranking task to achieve competitive performance across diverse tasks without extensive domain-specific tuning.

Authors:Shay Dahary, Avi Edana, Alexander Apartsin, Yehudit Aperstein
Title: From Joy to Fear: A Benchmark of Emotion Estimation in Pop Song Lyrics
Abstract:
The emotional content of song lyrics plays a pivotal role in shaping listener experiences and influencing musical preferences. This paper investigates the task of multi-label emotional attribution of song lyrics by predicting six emotional intensity scores corresponding to six fundamental emotions. A manually labeled dataset is constructed using a mean opinion score (MOS) approach, which aggregates annotations from multiple human raters to ensure reliable ground-truth labels. Leveraging this dataset, we conduct a comprehensive evaluation of several publicly available large language models (LLMs) under zero-shot scenarios. Additionally, we fine-tune a BERT-based model specifically for predicting multi-label emotion scores. Experimental results reveal the relative strengths and limitations of zero-shot and fine-tuned models in capturing the nuanced emotional content of lyrics. Our findings highlight the potential of LLMs for emotion recognition in creative texts, providing insights into model selection strategies for emotion-based music information retrieval applications. The labeled dataset is available at https://github.com/LLM-HITCS25S/LyricsEmotionAttribution.
本研究评估大语言模型在预测歌词多标签情感强度方面的表现,通过比较零样本与微调方法推进基于情感的音乐检索应用。
This study evaluates large language models for predicting multi-label emotional intensity in song lyrics, comparing zero-shot and fine-tuned approaches to advance emotion-based music retrieval.

Authors:Jungin Park, Jiyoung Lee, Kwanghoon Sohn
Title: Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization
Abstract:
Video summarization aims to select keyframes that are visually diverse and can represent the whole story of a given video. Previous approaches have focused on global interlinkability between frames in a video by temporal modeling. However, fine-grained visual entities, such as objects, are also highly related to the main content of the video. Moreover, language-guided video summarization, which has recently been studied, requires a comprehensive linguistic understanding of complex real-world videos. To consider how all the objects are semantically related to each other, this paper regards video summarization as a language-guided spatiotemporal graph modeling problem. We present recursive spatiotemporal graph networks, called VideoGraph, which formulate the objects and frames as nodes of the spatial and temporal graphs, respectively. The nodes in each graph are connected and aggregated with graph edges, representing the semantic relationships between the nodes. To prevent the edges from being configured with visual similarity, we incorporate language queries derived from the video into the graph node representations, enabling them to contain semantic knowledge. In addition, we adopt a recursive strategy to refine initial graphs and correctly classify each frame node as a keyframe. In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and query-focused video summarization in both supervised and unsupervised manners. The code is available at https://github.com/park-jungin/videograph.
Chinese: 本文提出VideoGraph方法,将视频摘要视为语言引导的时空图建模问题,通过递归图网络整合对象与帧之间的语义关系,在多项基准测试中实现了最先进的性能。
English: This paper introduces VideoGraph, a novel approach that treats video summarization as a language-guided spatiotemporal graph modeling problem, achieving state-of-the-art performance by incorporating semantic relationships between objects and frames through recursive graph networks.

Authors:Changtao Miao, Yi Zhang, Man Luo, Weiwei Feng, Kaiyuan Zheng, Qi Chu, Tao Gong, Jianshu Li, Yunfeng Diao, Wei Zhou, Joey Tianyi Zhou, Xiaoshuai Hao
Title: MFFI: Multi-Dimensional Face Forgery Image Dataset for Real-World Scenarios
Abstract:
Rapid advances in Artificial Intelligence Generated Content (AIGC) have enabled increasingly sophisticated face forgeries, posing a significant threat to social security. However, current Deepfake detection methods are limited by constraints in existing datasets, which lack the diversity necessary in real-world scenarios. Specifically, these data sets fall short in four key areas: unknown of advanced forgery techniques, variability of facial scenes, richness of real data, and degradation of real-world propagation. To address these challenges, we propose the Multi-dimensional Face Forgery Image (\textbf{MFFI}) dataset, tailored for real-world scenarios. MFFI enhances realism based on four strategic dimensions: 1) Wider Forgery Methods; 2) Varied Facial Scenes; 3) Diversified Authentic Data; 4) Multi-level Degradation Operations. MFFI integrates $50$ different forgery methods and contains $1024K$ image samples. Benchmark evaluations show that MFFI outperforms existing public datasets in terms of scene complexity, cross-domain generalization capability, and detection difficulty gradients. These results validate the technical advance and practical utility of MFFI in simulating real-world conditions. The dataset and additional details are publicly available at {https://github.com/inclusionConf/MFFI}.
中文:MFFI数据集通过融合多种伪造方法、多样化场景、真实数据及现实退化处理,弥补了现有Deepfake检测数据集的不足,显著提升了真实场景下的检测性能。
English: The MFFI dataset addresses limitations in current Deepfake detection by incorporating diverse forgery methods, varied scenes, authentic data, and real-world degradation, enhancing detection capabilities for real-world scenarios.

Authors:Zixi Li
Title: TreeGPT: Pure TreeFFN Encoder-Decoder Architecture for Structured Reasoning Without Attention Mechanisms
Abstract:
We present TreeGPT, an attention-free neural architecture that explores the potential of pure TreeFFN encoder-decoder design for structured reasoning tasks. Unlike traditional transformer approaches that rely on attention mechanisms, TreeGPT employs bidirectional TreeFFN components that process sequences through adjacent connections in parallel, aiming to achieve computational efficiency while maintaining reasoning capabilities. Our approach centers on a TreeFFN Encoder-Decoder mechanism: $$\text{Encoder TreeFFN (L} \rightarrow \text{R)} + \text{Decoder TreeFFN (R} \leftarrow \text{L)} \rightarrow \text{Parallel Processing}$$ where the encoder processes left-to-right dependencies while the decoder handles right-to-left patterns, both using simple neighbor-to-neighbor connections. This design eliminates attention computation while maintaining sequence modeling capabilities. We evaluate our approach on the ARC Prize 2025 dataset, where TreeGPT achieves 99\% validation accuracy using 3.16M parameters. The model converges within 1500 training steps and demonstrates 100\% token-level accuracy on selected evaluation samples. Our preliminary results suggest that for certain structured reasoning tasks, specialized TreeFFN architectures may offer advantages over attention-based approaches. While these findings are encouraging, we acknowledge that further investigation across diverse tasks and datasets would be valuable to establish the broader applicability of attention-free designs.
中文摘要:TreeGPT是一种无需注意力机制的神经架构,采用双向TreeFFN编码器-解码器组件进行并行序列处理,在ARC Prize 2025数据集上以316万参数实现99%验证准确率,在保持推理能力的同时展现出高效的计算性能。
English Summary: TreeGPT is an attention-free neural architecture using bidirectional TreeFFN encoder-decoder components for parallel sequence processing, achieving 99% validation accuracy on ARC Prize 2025 with efficient computational performance while maintaining reasoning capabilities.

Authors:Andrej Orsula, Matthieu Geist, Miguel Olivares-Mendez, Carol Martinez
Title: Learning Tool-Aware Adaptive Compliant Control for Autonomous Regolith Excavation
Abstract:
Autonomous regolith excavation is a cornerstone of in-situ resource utilization for a sustained human presence beyond Earth. However, this task is fundamentally hindered by the complex interaction dynamics of granular media and the operational need for robots to use diverse tools. To address these challenges, this work introduces a framework where a model-based reinforcement learning agent learns within a parallelized simulation. This environment leverages high-fidelity particle physics and procedural generation to create a vast distribution of both lunar terrains and excavation tool geometries. To master this diversity, the agent learns an adaptive interaction strategy by dynamically modulating its own stiffness and damping at each control step through operational space control. Our experiments demonstrate that training with a procedural distribution of tools is critical for generalization and enables the development of sophisticated tool-aware behavior. Furthermore, we show that augmenting the agent with visual feedback significantly improves task success. These results represent a validated methodology for developing the robust and versatile autonomous systems required for the foundational tasks of future space missions.
中文摘要:本研究开发了一种基于模型的强化学习框架,通过高精度粒子仿真使自主机器人能够掌握跨多种月球地形和挖掘工具的适应性作业策略,证明程序化工具训练与视觉反馈可显著提升未来太空任务中系统的泛化能力和作业成功率。
English Summary: This study develops a model-based reinforcement learning framework using high-fidelity particle simulations to enable autonomous robots to master adaptive excavation strategies across diverse lunar terrains and tool geometries, demonstrating that procedural tool training and visual feedback significantly enhance generalization and task success for future space missions.

Authors:Jie Fu, Hong Yuan, Zhili Chen, Wendy Hui Wang
Title: Safeguarding Graph Neural Networks against Topology Inference Attacks
Abstract:
Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, their widespread adoption has raised serious privacy concerns. While prior research has primarily focused on edge-level privacy, a critical yet underexplored threat lies in topology privacy - the confidentiality of the graph's overall structure. In this work, we present a comprehensive study on topology privacy risks in GNNs, revealing their vulnerability to graph-level inference attacks. To this end, we propose a suite of Topology Inference Attacks (TIAs) that can reconstruct the structure of a target training graph using only black-box access to a GNN model. Our findings show that GNNs are highly susceptible to these attacks, and that existing edge-level differential privacy mechanisms are insufficient as they either fail to mitigate the risk or severely compromise model accuracy. To address this challenge, we introduce Private Graph Reconstruction (PGR), a novel defense framework designed to protect topology privacy while maintaining model accuracy. PGR is formulated as a bi-level optimization problem, where a synthetic training graph is iteratively generated using meta-gradients, and the GNN model is concurrently updated based on the evolving graph. Extensive experiments demonstrate that PGR significantly reduces topology leakage with minimal impact on model accuracy. Our code is available at https://github.com/JeffffffFu/PGR.
中文: 本研究通过新型拓扑推理攻击揭示了图神经网络存在的严重拓扑隐私漏洞,并提出了私有图重构防御框架,该方案能在保护图结构机密性的同时有效维持模型精度。
English: This study exposes significant topology privacy vulnerabilities in Graph Neural Networks (GNNs) through novel Topology Inference Attacks and introduces Private Graph Reconstruction, a defense framework that effectively protects graph structure confidentiality while preserving model accuracy.

Authors:Ashen Rodrigo, Isuru Munasinghe, Asanka Perera
Title: Vision-Based Object Detection for UAV Solar Panel Inspection Using an Enhanced Defects Dataset
Abstract:
Timely and accurate detection of defects and contaminants in solar panels is critical for maintaining the efficiency and reliability of photovoltaic systems. This study presents a comprehensive evaluation of five state-of-the-art object detection models: YOLOv3, Faster R-CNN, RetinaNet, EfficientDet, and Swin Transformer, for identifying physical and electrical defects as well as surface contaminants such as dust, dirt, and bird droppings on solar panels. A custom dataset, annotated in the COCO format and specifically designed for solar panel defect and contamination detection, was developed alongside a user interface to train and evaluate the models. The performance of each model is assessed and compared based on mean Average Precision (mAP), precision, recall, and inference speed. The results demonstrate the trade-offs between detection accuracy and computational efficiency, highlighting the relative strengths and limitations of each model. These findings provide valuable guidance for selecting appropriate detection approaches in practical solar panel monitoring and maintenance scenarios. The dataset will be publicly available at https://github.com/IsuruMunasinghe98/solar-panel-inspection-dataset.
Chinese: 本研究评估了五种先进目标检测模型在识别太阳能电池板缺陷与污染物方面的性能,通过对比精度与效率为实际监测应用提供指导。
English: This study evaluates five advanced object detection models for identifying defects and contaminants on solar panels, comparing their accuracy and efficiency to guide practical monitoring applications.

Authors:Gaspard Beaudouin, Minghan Li, Jaeyeon Kim, Sung-Hoon Yoon, Mengyu Wang
Title: Delta Velocity Rectified Flow for Text-to-Image Editing
Abstract:
We propose Delta Velocity Rectified Flow (DVRF), a novel inversion-free, path-aware editing framework within rectified flow models for text-to-image editing. DVRF is a distillation-based method that explicitly models the discrepancy between the source and target velocity fields in order to mitigate over-smoothing artifacts rampant in prior distillation sampling approaches. We further introduce a time-dependent shift term to push noisy latents closer to the target trajectory, enhancing the alignment with the target distribution. We theoretically demonstrate that when this shift is disabled, DVRF reduces to Delta Denoising Score, thereby bridging score-based diffusion optimization and velocity-based rectified-flow optimization. Moreover, when the shift term follows a linear schedule under rectified-flow dynamics, DVRF generalizes the Inversion-free method FlowEdit and provides a principled theoretical interpretation for it. Experimental results indicate that DVRF achieves superior editing quality, fidelity, and controllability while requiring no architectural modifications, making it efficient and broadly applicable to text-to-image editing tasks. Code is available at https://github.com/Harvard-AI-and-Robotics-Lab/DeltaVelocityRectifiedFlow.
中文: 我们提出了Delta Velocity Rectified Flow (DVRF),这是一种无反转的编辑框架,通过建模速度场差异并引入时间相关偏移来提升文本到图像编辑的质量和对齐度,无需修改模型架构。
English: We introduce Delta Velocity Rectified Flow (DVRF), an inversion-free framework that models velocity field discrepancies and incorporates a time-dependent shift to enhance text-to-image editing quality and alignment without architectural changes.

Authors:Matteo Poggi, Fabio Tosi
Title: FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases
Abstract:
We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.
Chinese: FlowSeek是一种高效的光流框架,仅需单个消费级GPU即可完成训练,在多个数据集上实现卓越的跨数据集泛化能力,性能比先前最优方法提升10-15%。
English: FlowSeek is a highly efficient optical flow framework that achieves superior cross-dataset generalization with minimal hardware requirements, training on a single consumer-grade GPU while outperforming previous state-of-the-art methods by 10-15%.

Authors:Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, Tong He
Title: WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool
Abstract:
We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.
中文: WinT3R是一种前馈重建模型,通过滑动窗口机制和紧凑相机表示,在在线重建质量、相机姿态估计和速度方面均达到领先水平。
English: WinT3R is a feed-forward reconstruction model that achieves state-of-the-art online reconstruction quality, camera pose estimation, and speed through a sliding window mechanism and compact camera representation.

Authors:Henri Doerks, Paul Häusner, Daniel Hernández Escobar, Jens Sjölund
Title: Learning to accelerate distributed ADMM using graph neural networks
Abstract:
Distributed optimization is fundamental in large-scale machine learning and control applications. Among existing methods, the Alternating Direction Method of Multipliers (ADMM) has gained popularity due to its strong convergence guarantees and suitability for decentralized computation. However, ADMM often suffers from slow convergence and sensitivity to hyperparameter choices. In this work, we show that distributed ADMM iterations can be naturally represented within the message-passing framework of graph neural networks (GNNs). Building on this connection, we propose to learn adaptive step sizes and communication weights by a graph neural network that predicts the hyperparameters based on the iterates. By unrolling ADMM for a fixed number of iterations, we train the network parameters end-to-end to minimize the final iterates error for a given problem class, while preserving the algorithm's convergence properties. Numerical experiments demonstrate that our learned variant consistently improves convergence speed and solution quality compared to standard ADMM. The code is available at https://github.com/paulhausner/learning-distributed-admm.
中文: 本文通过将分布式ADMM与图神经网络结合,学习自适应超参数,在保持理论收敛性的同时显著提升了收敛速度和解的质量。
English: This paper connects distributed ADMM with graph neural networks to learn adaptive hyperparameters, enhancing convergence speed and solution quality while maintaining theoretical guarantees.

Authors:Zhen Qin, Xuyang Shen, Yiran Zhong
Title: Elucidating the Design Space of Decay in Linear Attention
Abstract:
This paper presents a comprehensive investigation into the decay mechanisms inherent in linear complexity sequence models. We systematically delineate the design space of decay mechanisms across four pivotal dimensions: parameterization strategy, which refers to the computational methodology for decay; parameter sharing, which involves the utilization of supplementary parameters for decay computation; decay granularity, comparing scalar versus vector-based decay; and compatibility with relative positional encoding methods, such as Rotary Position Embedding (RoPE). Through an extensive series of experiments conducted on diverse language modeling tasks, we uncovered several critical insights. Firstly, the design of the parameterization strategy for decay requires meticulous consideration. Our findings indicate that effective configurations are typically confined to a specific range of parameters. Secondly, parameter sharing cannot be used arbitrarily, as it may cause decay values to be too large or too small, thereby significantly impacting performance. Thirdly, under identical parameterization strategies, scalar decay generally underperforms compared to its vector-based counterpart. However, in certain scenarios with alternative parameterization strategies, scalar decay may unexpectedly surpass vector decay in efficacy. Lastly, our analysis reveals that RoPE, a commonly employed relative positional encoding method, typically fails to provide tangible benefits to the majority of linear attention mechanisms.
本研究系统探讨了线性复杂度序列模型在参数化策略、共享机制、衰减粒度和RoPE兼容性四个关键维度的衰减机制设计,揭示了有效配置参数范围严格受限、标量衰减通常弱于向量衰减、以及RoPE对多数线性注意力机制增益有限的重要发现。
This study systematically explores decay mechanisms in linear complexity sequence models across four key design dimensions—parameterization, sharing, granularity, and RoPE compatibility—revealing that effective configurations are narrowly constrained, scalar decay generally underperforms vector decay, and RoPE offers limited benefits to most linear attention mechanisms.

Authors:Zijian Wang, Wei Tong, Tingxuan Han, Haoyu Chen, Tianling Zhang, Yunlong Mao, Sheng Zhong
Title: On Evaluating the Poisoning Robustness of Federated Learning under Local Differential Privacy
Abstract:
Federated learning (FL) combined with local differential privacy (LDP) enables privacy-preserving model training across decentralized data sources. However, the decentralized data-management paradigm leaves LDPFL vulnerable to participants with malicious intent. The robustness of LDPFL protocols, particularly against model poisoning attacks (MPA), where adversaries inject malicious updates to disrupt global model convergence, remains insufficiently studied. In this paper, we propose a novel and extensible model poisoning attack framework tailored for LDPFL settings. Our approach is driven by the objective of maximizing the global training loss while adhering to local privacy constraints. To counter robust aggregation mechanisms such as Multi-Krum and trimmed mean, we develop adaptive attacks that embed carefully crafted constraints into a reverse training process, enabling evasion of these defenses. We evaluate our framework across three representative LDPFL protocols, three benchmark datasets, and two types of deep neural networks. Additionally, we investigate the influence of data heterogeneity and privacy budgets on attack effectiveness. Experimental results demonstrate that our adaptive attacks can significantly degrade the performance of the global model, revealing critical vulnerabilities and highlighting the need for more robust LDPFL defense strategies against MPA. Our code is available at https://github.com/ZiJW/LDPFL-Attack
中文: 本文针对本地差分隐私联邦学习系统提出了一种新型自适应模型投毒攻击框架,通过逆向训练嵌入约束条件来规避鲁棒聚合防御机制,在多种协议和数据集上显著降低全局模型性能,揭示了当前防御策略的关键脆弱性。
English: This paper introduces a novel adaptive model poisoning attack framework for LDPFL systems that bypasses robust aggregation defenses by embedding constraints through reverse training, significantly degrading global model performance across multiple protocols and datasets while highlighting critical vulnerabilities in current defense strategies.

Authors:Mohammad Saeid, Amir Salarpour, Pedram MohajerAnsari
Title: Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet
Abstract:
The classification of 3D point clouds is crucial for applications such as autonomous driving, robotics, and augmented reality. However, the commonly used ModelNet40 dataset suffers from limitations such as inconsistent labeling, 2D data, size mismatches, and inadequate class differentiation, which hinder model performance. This paper introduces ModelNet-R, a meticulously refined version of ModelNet40 designed to address these issues and serve as a more reliable benchmark. Additionally, this paper proposes Point-SkipNet, a lightweight graph-based neural network that leverages efficient sampling, neighborhood grouping, and skip connections to achieve high classification accuracy with reduced computational overhead. Extensive experiments demonstrate that models trained in ModelNet-R exhibit significant performance improvements. Notably, Point-SkipNet achieves state-of-the-art accuracy on ModelNet-R with a substantially lower parameter count compared to contemporary models. This research highlights the crucial role of dataset quality in optimizing model efficiency for 3D point cloud classification. For more details, see the code at: https://github.com/m-saeid/ModeNetR_PointSkipNet.
中文: 本文提出了改进的3D点云数据集ModelNet-R以解决ModelNet40的缺陷,并设计了轻量级神经网络Point-SkipNet,该网络以更少参数实现最优分类精度,凸显了数据集质量对模型效能的关键作用。
English: This paper introduces ModelNet-R, an improved 3D point cloud dataset addressing ModelNet40's limitations, and proposes Point-SkipNet, a lightweight neural network that achieves top accuracy with fewer parameters, emphasizing dataset quality's role in model efficiency.

Authors:Julia Dietlmeier, Oluwabukola Grace Adegboro, Vayangi Ganepola, Claudia Mazo, Noel E. O'Connor
Title: VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation
Abstract:
Vision-language models and their adaptations to image segmentation tasks present enormous potential for producing highly accurate and interpretable results. However, implementations based on CLIP and BiomedCLIP are still lagging behind more sophisticated architectures such as CRIS. In this work, instead of focusing on text prompt engineering as is the norm, we attempt to narrow this gap by showing how to ensemble vision-language segmentation models (VLSMs) with a low-complexity CNN. By doing so, we achieve a significant Dice score improvement of 6.3% on the BKAI polyp dataset using the ensembled BiomedCLIPSeg, while other datasets exhibit gains ranging from 1% to 6%. Furthermore, we provide initial results on additional four radiology and non-radiology datasets. We conclude that ensembling works differently across these datasets (from outperforming to underperforming the CRIS model), indicating a topic for future investigation by the community. The code is available at https://github.com/juliadietlmeier/VLSM-Ensemble.
中文: 本研究通过将视觉语言分割模型与低复杂度CNN集成,显著提升了医学图像分割性能,在部分数据集上Dice分数最高提升6.3%,但不同数据集表现存在差异,为未来研究提供了新方向。
English: This research demonstrates that ensembling vision-language segmentation models with a low-complexity CNN significantly improves performance, achieving up to a 6.3% Dice score increase on medical datasets, though results vary compared to sophisticated architectures like CRIS.

Authors:Yanzhi Tian, Zeming Liu, Zhengyang Liu, Chong Feng, Xin Li, Heyan Huang, Yuhang Guo
Title: PRIM: Towards Practical In-Image Multilingual Machine Translation
Abstract:
In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.
中文: 本研究提出了PRIM数据集和端到端的VisTrans模型,以解决实际场景中的图像内多语言机器翻译问题,相比其他模型在翻译质量和视觉效果上表现更优。
English: This study introduces the PRIM dataset and an end-to-end model called VisTrans to address practical in-image multilingual machine translation challenges, achieving superior translation quality and visual effects compared to existing models.

Authors:Rafael Bischof, Michal Piovarči, Michael A. Kraus, Siddhartha Mishra, Bernd Bickel
Title: HyPINO: Multi-Physics Neural Operators via HyperPINNs and the Method of Manufactured Solutions
Abstract:
We present HyPINO, a multi-physics neural operator designed for zero-shot generalization across a broad class of parametric PDEs without requiring task-specific fine-tuning. Our approach combines a Swin Transformer-based hypernetwork with mixed supervision: (i) labeled data from analytical solutions generated via the Method of Manufactured Solutions (MMS), and (ii) unlabeled samples optimized using physics-informed objectives. The model maps PDE parametrizations to target Physics-Informed Neural Networks (PINNs) and can handle linear elliptic, hyperbolic, and parabolic equations in two dimensions with varying source terms, geometries, and mixed Dirichlet/Neumann boundary conditions, including interior boundaries. HyPINO achieves strong zero-shot accuracy on seven benchmark problems from PINN literature, outperforming U-Nets, Poseidon, and Physics-Informed Neural Operators (PINO). Further, we introduce an iterative refinement procedure that compares the physics of the generated PINN to the requested PDE and uses the discrepancy to generate a "delta" PINN. Summing their contributions and repeating this process forms an ensemble whose combined solution progressively reduces the error on six benchmarks and achieves over 100x gain in average $L_2$ loss in the best case, while retaining forward-only inference. Additionally, we evaluate the fine-tuning behavior of PINNs initialized by HyPINO and show that they converge faster and to lower final error than both randomly initialized and Reptile-meta-learned PINNs on five benchmarks, performing on par on the remaining two. Our results highlight the potential of this scalable approach as a foundation for extending neural operators toward solving increasingly complex, nonlinear, and high-dimensional PDE problems. The code and model weights are publicly available at https://github.com/rbischof/hypino.
中文: HyPINO 是一种多物理场神经算子,通过基于 Swin Transformer 的超网络与混合监督实现参数化偏微分方程的零样本泛化,其性能优于现有方法,并为复杂偏微分方程问题提供了可扩展的解决方案。
English: HyPINO is a multi-physics neural operator that achieves zero-shot generalization across parametric PDEs through a Swin Transformer-based hypernetwork with mixed supervision, outperforming existing methods and enabling scalable solutions for complex PDE problems.

Authors:Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, J. Marius Zöllner
Title: Robust Experts: the Effect of Adversarial Training on CNNs with Sparse Mixture-of-Experts Layers
Abstract:
Robustifying convolutional neural networks (CNNs) against adversarial attacks remains challenging and often requires resource-intensive countermeasures. We explore the use of sparse mixture-of-experts (MoE) layers to improve robustness by replacing selected residual blocks or convolutional layers, thereby increasing model capacity without additional inference cost. On ResNet architectures trained on CIFAR-100, we find that inserting a single MoE layer in the deeper stages leads to consistent improvements in robustness under PGD and AutoPGD attacks when combined with adversarial training. Furthermore, we discover that when switch loss is used for balancing, it causes routing to collapse onto a small set of overused experts, thereby concentrating adversarial training on these paths and inadvertently making them more robust. As a result, some individual experts outperform the gated MoE model in robustness, suggesting that robust subpaths emerge through specialization. Our code is available at https://github.com/KASTEL-MobilityLab/robust-sparse-moes.
中文: 该研究表明,在卷积神经网络中引入稀疏专家混合层能通过增加模型容量而不提升推理成本来增强对抗攻击的鲁棒性,同时发现当路由集中于少数过度使用的专家时,会形成专门化的鲁棒子路径。
English: This study demonstrates that integrating sparse mixture-of-experts layers into CNNs enhances robustness against adversarial attacks by increasing model capacity without extra inference costs, while also revealing that specialized robust subpaths emerge when routing collapses onto overused experts.

Authors:Luca Müller, Hassan Ali, Philipp Allgeuer, Lukáš Gajdošech, Stefan Wermter
Title: Pointing-Guided Target Estimation via Transformer-Based Attention
Abstract:
Deictic gestures, like pointing, are a fundamental form of non-verbal communication, enabling humans to direct attention to specific objects or locations. This capability is essential in Human-Robot Interaction (HRI), where robots should be able to predict human intent and anticipate appropriate responses. In this work, we propose the Multi-Modality Inter-TransFormer (MM-ITF), a modular architecture to predict objects in a controlled tabletop scenario with the NICOL robot, where humans indicate targets through natural pointing gestures. Leveraging inter-modality attention, MM-ITF maps 2D pointing gestures to object locations, assigns a likelihood score to each, and identifies the most likely target. Our results demonstrate that the method can accurately predict the intended object using monocular RGB data, thus enabling intuitive and accessible human-robot collaboration. To evaluate the performance, we introduce a patch confusion matrix, providing insights into the model's predictions across candidate object locations. Code available at: https://github.com/lucamuellercode/MMITF.
Chinese Summary: 本研究提出MM-ITF模块化架构,通过跨模态注意力机制将二维指向手势映射至目标物体位置,利用单目RGB数据实现精准意图识别,并引入区块混淆矩阵评估模型性能,为人机协作提供直观交互方案。
English Summary: The study introduces MM-ITF, a modular architecture that accurately predicts target objects from human pointing gestures using monocular RGB data, enhancing intuitive human-robot collaboration through inter-modality attention and a novel evaluation metric.

Authors:Hulin Li, Qiliang Ren, Jun Li, Hanbing Wei, Zheng Liu, Linfang Fan
Title: A biologically inspired separable learning vision model for real-time traffic object perception in Dark
Abstract:
Fast and accurate object perception in low-light traffic scenes has attracted increasing attention. However, due to severe illumination degradation and the lack of reliable visual cues, existing perception models and methods struggle to quickly adapt to and accurately predict in low-light environments. Moreover, there is the absence of available large-scale benchmark specifically focused on low-light traffic scenes. To bridge this gap, we introduce a physically grounded illumination degradation method tailored to real-world low-light settings and construct Dark-traffic, the largest densely annotated dataset to date for low-light traffic scenes, supporting object detection, instance segmentation, and optical flow estimation. We further propose the Separable Learning Vision Model (SLVM), a biologically inspired framework designed to enhance perception under adverse lighting. SLVM integrates four key components: a light-adaptive pupillary mechanism for illumination-sensitive feature extraction, a feature-level separable learning strategy for efficient representation, task-specific decoupled branches for multi-task separable learning, and a spatial misalignment-aware fusion module for precise multi-feature alignment. Extensive experiments demonstrate that SLVM achieves state-of-the-art performance with reduced computational overhead. Notably, it outperforms RT-DETR by 11.2 percentage points in detection, YOLOv12 by 6.1 percentage points in instance segmentation, and reduces endpoint error (EPE) of baseline by 12.37% on Dark-traffic. On the LIS benchmark, the end-to-end trained SLVM surpasses Swin Transformer+EnlightenGAN and ConvNeXt-T+EnlightenGAN by an average of 11 percentage points across key metrics, and exceeds Mask RCNN (with light enhancement) by 3.1 percentage points. The Dark-traffic dataset and complete code is released at https://github.com/alanli1997/slvm.
中文: 本文提出了针对低光照交通场景的最大标注数据集Dark-traffic,并设计了仿生可分离学习视觉模型SLVM,该模型以更低计算成本在目标检测、实例分割和光流估计任务中实现了最优性能。
English: This paper introduces Dark-traffic, the largest annotated dataset for low-light traffic scenes, and proposes the Separable Learning Vision Model (SLVM), a biologically inspired framework that achieves state-of-the-art performance in object detection, instance segmentation, and optical flow estimation with reduced computational costs.

Authors:Jie Chen, Jinhao Jiang, Yingqian Min, Zican Dong, Shijie Wang, Wayne Xin Zhao, Ji-Rong Wen
Title: Sticker-TTS: Learn to Utilize Historical Experience with a Sticker-driven Test-Time Scaling Framework
Abstract:
Large reasoning models (LRMs) have exhibited strong performance on complex reasoning tasks, with further gains achievable through increased computational budgets at inference. However, current test-time scaling methods predominantly rely on redundant sampling, ignoring the historical experience utilization, thereby limiting computational efficiency. To overcome this limitation, we propose Sticker-TTS, a novel test-time scaling framework that coordinates three collaborative LRMs to iteratively explore and refine solutions guided by historical attempts. At the core of our framework are distilled key conditions-termed stickers-which drive the extraction, refinement, and reuse of critical information across multiple rounds of reasoning. To further enhance the efficiency and performance of our framework, we introduce a two-stage optimization strategy that combines imitation learning with self-improvement, enabling progressive refinement. Extensive evaluations on three challenging mathematical reasoning benchmarks, including AIME-24, AIME-25, and OlymMATH, demonstrate that Sticker-TTS consistently surpasses strong baselines, including self-consistency and advanced reinforcement learning approaches, under comparable inference budgets. These results highlight the effectiveness of sticker-guided historical experience utilization. Our code and data are available at https://github.com/RUCAIBox/Sticker-TTS.
中文摘要:Sticker-TTS是一种新颖的测试时扩展框架,通过协调多个大型推理模型利用历史经验迭代优化解决方案,在相同计算预算下于数学推理基准测试中显著优于现有方法。
English Summary: Sticker-TTS is a novel test-time scaling framework that enhances computational efficiency by coordinating multiple large reasoning models to iteratively refine solutions using distilled historical information, outperforming existing methods on mathematical reasoning benchmarks under comparable inference budgets.

Authors:Midhun Shyam, Jim Basilakis, Kieran Luken, Steven Thomas, John Crozier, Paul M. Middleton, X. Rosalind Wang
Title: Classification of kinetic-related injury in hospital triage data using NLP
Abstract:
Triage notes, created at the start of a patient's hospital visit, contain a wealth of information that can help medical staff and researchers understand Emergency Department patient epidemiology and the degree of time-dependent illness or injury. Unfortunately, applying modern Natural Language Processing and Machine Learning techniques to analyse triage data faces some challenges: Firstly, hospital data contains highly sensitive information that is subject to privacy regulation thus need to be analysed on site; Secondly, most hospitals and medical facilities lack the necessary hardware to fine-tune a Large Language Model (LLM), much less training one from scratch; Lastly, to identify the records of interest, expert inputs are needed to manually label the datasets, which can be time-consuming and costly. We present in this paper a pipeline that enables the classification of triage data using LLM and limited compute resources. We first fine-tuned a pre-trained LLM with a classifier using a small (2k) open sourced dataset on a GPU; and then further fine-tuned the model with a hospital specific dataset of 1000 samples on a CPU. We demonstrated that by carefully curating the datasets and leveraging existing models and open sourced data, we can successfully classify triage data with limited compute resources.
Chinese: 本文提出了一种流程,通过使用小规模开源和医院特定数据集对预训练大语言模型进行微调,能够在有限计算资源下成功实现分诊数据的分类。
English: This paper introduces a pipeline that enables triage data classification using large language models with limited computational resources by fine-tuning pre-trained models on small datasets, both open-sourced and hospital-specific.

Authors:Hongyi Jing, Jiafu Chen, Chen Rao, Ziqiang Dang, Jiajie Teng, Tianyi Chu, Juncheng Mo, Shuo Fang, Huaizhong Lin, Rui Lv, Chenguang Ma, Lei Zhao
Title: SparkUI-Parser: Enhancing GUI Perception with Robust Grounding and Parsing
Abstract:
The existing Multimodal Large Language Models (MLLMs) for GUI perception have made great progress. However, the following challenges still exist in prior methods: 1) They model discrete coordinates based on text autoregressive mechanism, which results in lower grounding accuracy and slower inference speed. 2) They can only locate predefined sets of elements and are not capable of parsing the entire interface, which hampers the broad application and support for downstream tasks. To address the above issues, we propose SparkUI-Parser, a novel end-to-end framework where higher localization precision and fine-grained parsing capability of the entire interface are simultaneously achieved. Specifically, instead of using probability-based discrete modeling, we perform continuous modeling of coordinates based on a pre-trained Multimodal Large Language Model (MLLM) with an additional token router and coordinate decoder. This effectively mitigates the limitations inherent in the discrete output characteristics and the token-by-token generation process of MLLMs, consequently boosting both the accuracy and the inference speed. To further enhance robustness, a rejection mechanism based on a modified Hungarian matching algorithm is introduced, which empowers the model to identify and reject non-existent elements, thereby reducing false positives. Moreover, we present ScreenParse, a rigorously constructed benchmark to systematically assess structural perception capabilities of GUI models across diverse scenarios. Extensive experiments demonstrate that our approach consistently outperforms SOTA methods on ScreenSpot, ScreenSpot-v2, CAGUI-Grounding and ScreenParse benchmarks. The resources are available at https://github.com/antgroup/SparkUI-Parser.
中文摘要:现有GUI感知多模态大语言模型因离散坐标建模和有限元素检测存在精度与速度问题,SparkUI-Parser通过连续坐标建模和增强解析能力,在多个基准测试中实现更优性能。
English Summary: Existing multimodal large language models for GUI perception face challenges in accuracy and speed due to discrete coordinate modeling and limited element detection, which SparkUI-Parser addresses through continuous coordinate modeling and enhanced parsing capabilities to achieve superior performance across multiple benchmarks.

Authors:Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, Jiajun Zhang
Title: ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning
Abstract:
Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.
Chinese: ACE-RL框架通过自适应约束量化长文本生成质量并利用强化学习,有效解决了现有方法的局限,在基准测试中显著优于现有技术,甚至超越了GPT-4o等专有系统。
English: The ACE-RL framework addresses limitations in long-form generation by using adaptive constraints to quantify response quality and reinforcement learning, significantly outperforming existing methods and even surpassing proprietary systems like GPT-4o in benchmarks.

Authors:Xinkui Lin, Yongxiu Xu, Minghao Tang, Shilong Zhang, Hongbo Xu, Hao Xu, Yubin Wang
Title: REMOTE: A Unified Multimodal Relation Extraction Framework with Multilevel Optimal Transport and Mixture-of-Experts
Abstract:
Multimodal relation extraction (MRE) is a crucial task in the fields of Knowledge Graph and Multimedia, playing a pivotal role in multimodal knowledge graph construction. However, existing methods are typically limited to extracting a single type of relational triplet, which restricts their ability to extract triplets beyond the specified types. Directly combining these methods fails to capture dynamic cross-modal interactions and introduces significant computational redundancy. Therefore, we propose a novel \textit{unified multimodal Relation Extraction framework with Multilevel Optimal Transport and mixture-of-Experts}, termed REMOTE, which can simultaneously extract intra-modal and inter-modal relations between textual entities and visual objects. To dynamically select optimal interaction features for different types of relational triplets, we introduce mixture-of-experts mechanism, ensuring the most relevant modality information is utilized. Additionally, considering that the inherent property of multilayer sequential encoding in existing encoders often leads to the loss of low-level information, we adopt a multilevel optimal transport fusion module to preserve low-level features while maintaining multilayer encoding, yielding more expressive representations. Correspondingly, we also create a Unified Multimodal Relation Extraction (UMRE) dataset to evaluate the effectiveness of our framework, encompassing diverse cases where the head and tail entities can originate from either text or image. Extensive experiments show that REMOTE effectively extracts various types of relational triplets and achieves state-of-the-art performanc on almost all metrics across two other public MRE datasets. We release our resources at https://github.com/Nikol-coder/REMOTE.
中文: REMOTE框架采用多层次最优传输和专家混合机制的统一多模态关系提取方法,能动态捕捉跨模态交互并保留底层特征,在多个数据集上实现了最先进的性能。
English: The REMOTE framework introduces a unified multimodal relation extraction approach using multilevel optimal transport and mixture-of-experts to dynamically capture cross-modal interactions and preserve low-level features, achieving state-of-the-art performance across multiple datasets.

Authors:Ming Dai, Wenxuan Cheng, Jiedong Zhuang, Jiang-jiang Liu, Hongshen Zhao, Zhenhua Feng, Wankou Yang
Title: PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination
Abstract:
Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.
中文摘要:PropVG提出了一种端到端的基于提议的框架,通过对比学习和多粒度判别机制解决视觉定位中的现有缺陷,在多个基准测试中实现了卓越性能。
English Summary: PropVG introduces an end-to-end proposal-based framework with contrastive learning and multi-granularity discrimination to address limitations in visual grounding, achieving superior performance across multiple benchmarks.

Authors:Jiahuan Yu, Aryan Taneja, Junfeng Lin, Minjia Zhang
Title: VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
Abstract:
Modern Large Language Model (LLM) serving systems increasingly support interactive applications, like real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment. This paper introduces VoltanaLLM, a system for SLO-aware, energy-efficient LLM serving, built from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained phase-specific control. It consists of a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints. We implement VoltanaLLM in SGLang and evaluate its performance over multiple state-of-the-art LLMs and real-world datasets. The results demonstrate that VoltanaLLM achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate, paving the way for sustainable and intelligent LLM serving. Code of VoltanaLLM is open-sourced on GitHub: https://github.com/Supercomputing-System-AI-Lab/VoltanaLLM.
Chinese: 本文提出VoltanaLLM系统,通过动态GPU频率调节和智能请求路由实现LLM服务能效优化,在保证服务等级目标的同时最高可节省36.3%的能耗。
English: This paper introduces VoltanaLLM, a system that optimizes energy efficiency in LLM serving through dynamic GPU frequency scaling and intelligent request routing, achieving up to 36.3% energy savings while maintaining service-level objectives.

Authors:Yujie Wang, Yunwei Zhao, Jing Yang, Han Han, Shiguang Shan, Jie Zhang
Title: Evaluating Cognitive-Behavioral Fixation via Multimodal User Viewing Patterns on Social Media
Abstract:
Digital social media platforms frequently contribute to cognitive-behavioral fixation, a phenomenon in which users exhibit sustained and repetitive engagement with narrow content domains. While cognitive-behavioral fixation has been extensively studied in psychology, methods for computationally detecting and evaluating such fixation remain underexplored. To address this gap, we propose a novel framework for assessing cognitive-behavioral fixation by analyzing users' multimodal social media engagement patterns. Specifically, we introduce a multimodal topic extraction module and a cognitive-behavioral fixation quantification module that collaboratively enable adaptive, hierarchical, and interpretable assessment of user behavior. Experiments on existing benchmarks and a newly curated multimodal dataset demonstrate the effectiveness of our approach, laying the groundwork for scalable computational analysis of cognitive fixation. All code in this project is publicly available for research purposes at https://github.com/Liskie/cognitive-fixation-evaluation.
中文: 本研究提出了一种新颖的计算框架,通过分析社交媒体多模态参与模式来检测认知行为固着,并在基准数据集上验证了有效性,相关代码已公开。
English: This study introduces a novel computational framework that detects cognitive-behavioral fixation through multimodal analysis of social media engagement, validated on benchmark datasets with publicly available code.

Authors:Svetlana Pavlitska, Beyza Keskin, Alwin Faßbender, Christian Hubschneider, J. Marius Zöllner
Title: Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation
Abstract:
Estimating accurate and well-calibrated predictive uncertainty is important for enhancing the reliability of computer vision models, especially in safety-critical applications like traffic scene perception. While ensemble methods are commonly used to quantify uncertainty by combining multiple models, a mixture of experts (MoE) offers an efficient alternative by leveraging a gating network to dynamically weight expert predictions based on the input. Building on the promising use of MoEs for semantic segmentation in our previous works, we show that well-calibrated predictive uncertainty estimates can be extracted from MoEs without architectural modifications. We investigate three methods to extract predictive uncertainty estimates: predictive entropy, mutual information, and expert variance. We evaluate these methods for an MoE with two experts trained on a semantical split of the A2D2 dataset. Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics under out-of-distribution (OOD) data. Additionally, we evaluate routing uncertainty computed via gate entropy and find that simple gating mechanisms lead to better calibration of routing uncertainty estimates than more complex classwise gates. Finally, our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.
中文: 专家混合模型(MoE)能比集成方法更高效地为语义分割提供校准良好的预测不确定性估计,其中预测熵和互信息等方法在分布外数据下展现出更高的可靠性。
English: Mixtures of Experts (MoE) provide well-calibrated predictive uncertainty estimates for semantic segmentation more efficiently than ensembles, with methods like predictive entropy and mutual information showing improved reliability under out-of-distribution data.

Authors:Rochana R. Obadage, Lamia Salsabil, Sawood Alam, Bipasha Banarjee, William A. Ingram, Edward A. Fox, Jian Wu
Title: Toward Robust URL Extraction for Open Science: A Study of arXiv File Formats and Temporal Trends
Abstract:
In this work, we study how URL extraction results depend on input format. We compiled a pilot dataset by extracting URLs from 10 arXiv papers and used the same heuristic method to extract URLs from four formats derived from the PDF files or the source LaTeX files. We found that accurate and complete URL extraction from any single format or a combination of multiple formats is challenging, with the best F1-score of 0.71. Using the pilot dataset, we evaluate extraction performance across formats and show that structured formats like HTML and XML produce more accurate results than PDFs or Text. Combining multiple formats improves coverage, especially when targeting research-critical resources. We further apply URL extraction on two tasks, namely classifying URLs into open-access datasets and software and the others, and analyzing the trend of URLs usage in arXiv papers from 1992 to 2024. These results suggest that using a combination of multiple formats achieves better performance on URL extraction than a single format, and the number of URLs in arXiv papers has been steadily increasing since 1992 to 2014 and has been drastically increasing from 2014 to 2024. The dataset and the Jupyter notebooks used for the preliminary analysis are publicly available at https://github.com/lamps-lab/arxiv-urls
中文: 研究表明,结合多种文档格式可提高URL提取性能,最佳F1值达0.71,同时发现2014至2024年间arXiv论文中的URL使用量显著增长。
English: This study demonstrates that combining multiple document formats improves URL extraction performance, achieving the best F1-score of 0.71, and reveals a significant increase in URL usage in arXiv papers from 2014 to 2024.

Authors:Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita
Title: Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
Abstract:
This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git
本文提出了一种动态视频人群检测方法,通过增强的视觉语言模型提取局部与全局特征,并利用全局优化实现时序一致性,在公开数据集上超越了现有最优方法。
This paper introduces a dynamic human group detection method in videos that combines local and global features using an enhanced Vision-Language Model and achieves temporal consistency through global optimization, outperforming existing techniques on public datasets.

Authors:Mustafa Munir, Alex Zhang, Radu Marculescu
Title: VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation
Abstract:
Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine-grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce \textit{VCMamba}, a novel vision backbone that integrates the strengths of CNNs and multi-directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi-directional Mamba blocks designed to efficiently model long-range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba's effectiveness through extensive experiments on ImageNet-1K classification and ADE20K semantic segmentation. Our VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN-B by 0.3% with 64% fewer parameters. Furthermore, VCMamba-B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer-L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at https://github.com/Wertyuui345/VCMamba.
中文: VCMamba是一种新型视觉骨干网络,融合了CNN的局部特征提取能力和多向Mamba SSM的全局上下文建模优势,在ImageNet分类和ADE20K分割任务中实现了线性复杂度的卓越性能。
English: VCMamba is a novel vision backbone that combines CNNs' local feature extraction with multi-directional Mamba SSMs' global context modeling, achieving superior performance with linear complexity on ImageNet classification and ADE20K segmentation.

Authors:Aisha Alansari, Hamzah Luqman
Title: AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs
Abstract:
Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs' hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic's widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs' outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: https://github.com/aishaalansari57/AraHalluEval
中文摘要:本研究首次对阿拉伯语及多语言大语言模型在问答和摘要任务中的幻觉现象进行全面评估,发现阿拉伯语预训练模型在减少事实错误方面优于多语言模型,并与基于推理的模型表现相当。
English Summary: This study presents the first comprehensive evaluation of hallucination in Arabic and multilingual large language models across question answering and summarization tasks, revealing that Arabic pre-trained models outperform multilingual ones and match reasoning-based models in reducing factual errors.

Authors:Zhenyu Wu, Jiaoyan Chen, Norman W. Paton
Title: Schema Inference for Tabular Data Repositories Using Large Language Models
Abstract:
Minimally curated tabular data often contain representational inconsistencies across heterogeneous sources, and are accompanied by sparse metadata. Working with such data is intimidating. While prior work has advanced dataset discovery and exploration, schema inference remains difficult when metadata are limited. We present SI-LLM (Schema Inference using Large Language Models), which infers a concise conceptual schema for tabular data using only column headers and cell values. The inferred schema comprises hierarchical entity types, attributes, and inter-type relationships. In extensive evaluation on two datasets from web tables and open data, SI-LLM achieves promising end-to-end results, as well as better or comparable results to state-of-the-art methods at each step. All source code, full prompts, and datasets of SI-LLM are available at https://github.com/PierreWoL/SILLM.
中文:SI-LLM利用大型语言模型,仅通过列标题和单元格值即可从元数据稀缺的表格数据中推断出层次化概念模式,在网页表格和开放数据集的评估中展现出优于或可比肩现有先进方法的性能。
English: SI-LLM utilizes large language models to infer hierarchical conceptual schemas from tabular data with limited metadata, demonstrating competitive performance against state-of-the-art methods in evaluations on web tables and open datasets.

Authors:Zehua Pei, Hui-Ling Zhen, Ying Zhang, Zhiyuan Yang, Xing Li, Xianzhi Yu, Mingxuan Yuan, Bei Yu
Title: Behavioral Fingerprinting of Large Language Models
Abstract:
Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting'' framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model's intrinsic cognitive and interactive styles. Using a curated \textit{Diagnostic Prompt Suite} and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model's interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting
中文摘要:本文提出的"行为指纹"框架揭示了尽管大型语言模型的核心能力趋于一致,但其交互行为因不同的对齐策略而产生显著差异。
English Summary: This paper introduces a "Behavioral Fingerprinting" framework that reveals how LLMs' interactive behaviors diverge due to varying alignment strategies, despite converging core capabilities.

Authors:Moeen Nehzati
Title: Universal Representation of Generalized Convex Functions and their Gradients
Abstract:
Solutions to a wide range of optimization problems, from optimal transport theory to mathematical economics, often take the form of generalized convex functions (GCFs). This characterization can be used to convert nested bilevel optimization problems into single-level optimization problems. Despite this, the characterization has not been fully exploited in numerical optimization. When the solution to an optimization problem is known to belong to a particular class of objects, this information can be leveraged by parameterizing that class of objects and optimizing over this parameterization. The hallmark of a good parameterization is the Universal Approximation Property (UAP): that is, the parameterization approximates any object in the class arbitrarily well. For example, neural networks satisfy the UAP with respect to the class of continuous functions. Building on the literature concerned with the parameterization of convex functions, we extend these ideas to GCFs. We present a convex and potentially one-to-one parameterization of GCFs and their gradients that satisfies the UAP. We also compare this class to shallow neural networks and highlight their shared characteristics. The ideas pursued here have been implemented in the Python package \href{https://github.com/MoeenNehzati/gconvex}{\texttt{gconvex}}, available online. Using it, we tackle the problem of finding the revenue-maximizing auction for multiple goods and demonstrate how our parameterization can effectively solve this problem.
中文摘要:广义凸函数(GCFs)为将复杂的双层优化问题转化为单层形式提供了有效框架,新提出的参数化方法具备通用逼近性质,并通过开源Python软件包实现了实际应用。
English Summary: Generalized convex functions (GCFs) offer a powerful framework for transforming complex bilevel optimization problems into simpler single-level forms, with a new parameterization method enabling universal approximation and practical implementation through an open-source Python package.

Authors:Seojin Kim, Hyeontae Song, Jaehyun Nam, Jinwoo Shin
Title: Training Text-to-Molecule Models with Context-Aware Tokenization
Abstract:
Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 outperforms the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models to further boost the generation performance. Code is available at https://github.com/Songhyeontae/CAMT5.git.
中文: 提出的上下文感知分子T5(CAMT5)模型通过引入子结构级标记化和基于重要性的训练策略,能更好地捕捉分子全局结构,在文本到分子任务中以极少的训练标记实现了卓越性能。
English: The proposed Context-Aware Molecular T5 (CAMT5) model introduces substructure-level tokenization and an importance-based training strategy to better capture global molecular structures, achieving superior performance in text-to-molecule tasks with significantly reduced training tokens.

Authors:Yihan Chen, Jiawei Chen, Guozhao Mo, Xuanang Chen, Ben He, Xianpei Han, Le Sun
Title: CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection
Abstract:
The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications. Our code and data will be publicly available at https://github.com/Y1hanChen/COCONUTS.
大语言模型在同行评审中的应用可能损害公平性与可靠性,为此我们提出了基于内容的检测方法CoCoNUTS和CoCoDet,以实现更精准公正的评估。
The integration of LLMs in peer review risks fairness and reliability, prompting the development of CoCoNUTS and CoCoDet for content-based detection to ensure accurate and equitable evaluation.

Authors:Zhiqiu Xu, Amish Sethi, Mayur Naik, Ser-Nam Lim
Title: Delta Activations: A Representation for Finetuned Large Language Models
Abstract:
The success of powerful open source Large Language Models (LLMs) has enabled the community to create a vast collection of post-trained models adapted to specific tasks and domains. However, navigating and understanding these models remains challenging due to inconsistent metadata and unstructured repositories. We introduce Delta Activations, a method to represent finetuned models as vector embeddings by measuring shifts in their internal activations relative to a base model. This representation allows for effective clustering by domain and task, revealing structure in the model landscape. Delta Activations also demonstrate desirable properties: it is robust across finetuning settings and exhibits an additive property when finetuning datasets are mixed. In addition, we show that Delta Activations can embed tasks via few-shot finetuning, and further explore its use for model selection and merging. We hope Delta Activations can facilitate the practice of reusing publicly available models. Code is available at https://github.com/OscarXZQ/delta_activations.
中文: Delta Activations 是一种创新方法,通过测量微调后大语言模型相对于基础模型的内部激活变化,将其表示为向量嵌入,从而实现按领域和任务的有效聚类,并展现出鲁棒性和可加性。
English: Delta Activations is a novel method that represents fine-tuned large language models as vector embeddings by measuring their internal activation shifts relative to a base model, enabling effective clustering by domain and task while demonstrating robustness and additive properties.

Authors:Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, Lianhui Qin
Title: ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory
Abstract:
While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.
Chinese: 本文提出了一种概念级记忆系统,能够从推理轨迹中提炼可重用的抽象概念,通过动态更新记忆实现测试时持续学习,在ARC-AGI基准测试中取得了7.5%的性能提升。
English: The paper introduces a concept-level memory system that distills reusable abstractions from reasoning traces, enabling test-time continual learning and achieving a 7.5% performance gain on the ARC-AGI benchmark through dynamic memory updates.

Authors:Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, Lianhui Qin
Title: ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory
Abstract:
While inference-time scaling enables LLMs to carry out increasingly long and capable reasoning traces, the patterns and insights uncovered during these traces are immediately discarded once the context window is reset for a new query. External memory is a natural way to persist these discoveries, and recent work has shown clear benefits for reasoning-intensive tasks. We see an opportunity to make such memories more broadly reusable and scalable by moving beyond instance-based memory entries (e.g. exact query/response pairs, or summaries tightly coupled with the original problem context) toward concept-level memory: reusable, modular abstractions distilled from solution traces and stored in natural language. For future queries, relevant concepts are selectively retrieved and integrated into the prompt, enabling test-time continual learning without weight updates. Our design introduces new strategies for abstracting takeaways from rollouts and retrieving entries for new queries, promoting reuse and allowing memory to expand with additional experiences. We evaluate on ARC-AGI, a benchmark that stresses compositional generalization and abstract reasoning, making it a natural fit for concept memory. Our method yields a 7.5% relative gain over a strong no-memory baseline with performance continuing to scale with inference compute. We find abstract concepts to be the most consistent memory design, outscoring the baseline at all tested inference compute scales. Moreover, dynamically updating memory during test-time outperforms fixed settings, supporting the hypothesis that accumulating and abstracting patterns enables further solutions in a form of self-improvement. Code is available at https://github.com/matt-seb-ho/arc_memo.
Chinese: 本文提出了一种概念级记忆系统,能够从推理轨迹中提炼可重用的抽象概念,通过动态更新记忆实现测试时持续学习,在ARC-AGI基准测试中取得了7.5%的性能提升。
English: The paper introduces a concept-level memory system that distills reusable abstractions from reasoning traces, enabling test-time continual learning and achieving a 7.5% performance gain on the ARC-AGI benchmark through dynamic memory updates.

Authors:Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah
Title: The Telephone Game: Evaluating Semantic Drift in Unified Models
Abstract:
Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T, as consistency between understanding and generation is critical for downstream use. Existing evaluations consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These single-pass metrics do not reveal whether a model that understands a concept can also render it, nor whether meaning is preserved when cycling between image and text modalities. To address this, we introduce the Unified Consistency Framework for Unified Models (UCF-UM), a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. UCF formulates 3 metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic loss; (ii) Semantic Drift Rate (SDR), that summarizes semantic decay rate; and (iii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO, which is widely used in training; we create a new benchmark ND400, sampled from NoCaps and DOCCI and evaluate on seven recent models. UCF-UM reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantics over many alternations, whereas others like Vila-u drift quickly despite strong single-pass scores. Our results highlight cyclic consistency as a necessary complement to standard I2T and T2I evaluations, and provide practical metrics to consistently assess unified model's cross-modal stability and strength of their shared representations. Code: https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models
Chinese Summary: 本文提出语义漂移协议(SDP),通过图文跨模态循环转换评估统一视觉语言模型的语义一致性,采用新度量指标和专用基准测试,揭示了不同模型在保持语义稳定性方面的显著差异。
English Summary: This paper introduces the Semantic Drift Protocol (SDP) to evaluate cross-modal consistency in unified visual language models by cycling between image-to-text and text-to-image tasks, revealing significant variations in semantic stability across models through novel metrics and a specialized benchmark.

Authors:Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah
Title: The Telephone Game: Evaluating Semantic Drift in Unified Models
Abstract:
Employing a single, unified model (UM) for both visual understanding (image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a new direction in Visual Language Model (VLM) research. While UMs can also support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks consider these capabilities in isolation: FID and GenEval for T2I, and benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do not reveal cross-consistency: whether a model that "understands" a concept can also "render" it, nor whether semantic meaning is preserved when cycling between image and text modalities. To address this, we introduce the Semantic Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that alternates I2T and T2I over multiple generations to quantify semantic drift. We propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an object-level compliance score extending GenEval. To assess generalization beyond COCO dataset, which is widely used in training; we create a new benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven recent models. SDP reveals substantial variation in cross-modal stability: some models like BAGEL maintain semantic meaning over many alternations, whereas others like VILA-U drift quickly despite strong single-pass scores. Our results highlight SDP as a necessary complement to standard I2T and T2I evaluations. Code is available at https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models
Chinese Summary: 本文提出语义漂移协议(SDP),通过图文跨模态循环转换评估统一视觉语言模型的语义一致性,采用新度量指标和专用基准测试,揭示了不同模型在保持语义稳定性方面的显著差异。
English Summary: This paper introduces the Semantic Drift Protocol (SDP) to evaluate cross-modal consistency in unified visual language models by cycling between image-to-text and text-to-image tasks, revealing significant variations in semantic stability across models through novel metrics and a specialized benchmark.

Authors:Zanwei Zhou, Taoran Yi, Jiemin Fang, Chen Yang, Lingxi Xie, Xinggang Wang, Wei Shen, Qi Tian
Title: Few-step Flow for 3D Generation via Marginal-Data Transport Distillation
Abstract:
Flow-based 3D generation models typically require dozens of sampling steps during inference. Though few-step distillation methods, particularly Consistency Models (CMs), have achieved substantial advancements in accelerating 2D diffusion models, they remain under-explored for more complex 3D generation tasks. In this study, we propose a novel framework, MDT-dist, for few-step 3D flow distillation. Our approach is built upon a primary objective: distilling the pretrained model to learn the Marginal-Data Transport. Directly learning this objective needs to integrate the velocity fields, while this integral is intractable to be implemented. Therefore, we propose two optimizable objectives, Velocity Matching (VM) and Velocity Distillation (VD), to equivalently convert the optimization target from the transport level to the velocity and the distribution level respectively. Velocity Matching (VM) learns to stably match the velocity fields between the student and the teacher, but inevitably provides biased gradient estimates. Velocity Distillation (VD) further enhances the optimization process by leveraging the learned velocity fields to perform probability density distillation. When evaluated on the pioneer 3D generation framework TRELLIS, our method reduces sampling steps of each flow transformer from 25 to 1 or 2, achieving 0.68s (1 step x 2) and 0.94s (2 steps x 2) latency with 9.0x and 6.5x speedup on A800, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve superior performance in few-step 3D generation.
中文: 本研究提出MDT-dist框架,通过速度匹配和速度蒸馏实现少步3D流蒸馏,将采样步骤从25步减少至1-2步,在A800上实现最高9倍加速且保持高质量生成效果,显著优于现有一致性模型蒸馏方法。
English: This study introduces MDT-dist, a novel framework for few-step 3D flow distillation that employs Velocity Matching and Velocity Distillation to efficiently reduce sampling steps from 25 to 1 or 2 while maintaining high fidelity, achieving up to 9x speedup on A800 and outperforming existing methods.

Authors:Kyra Wilson, Mattea Sim, Anna-Maria Gueorguieva, Aylin Caliskan
Title: No Thoughts Just AI: Biased LLM Hiring Recommendations Alter Human Decision Making and Limit Human Autonomy
Abstract:
In this study, we conduct a resume-screening experiment (N=528) where people collaborate with simulated AI models exhibiting race-based preferences (bias) to evaluate candidates for 16 high and low status occupations. Simulated AI bias approximates factual and counterfactual estimates of racial bias in real-world AI systems. We investigate people's preferences for White, Black, Hispanic, and Asian candidates (represented through names and affinity groups on quality-controlled resumes) across 1,526 scenarios and measure their unconscious associations between race and status using implicit association tests (IATs), which predict discriminatory hiring decisions but have not been investigated in human-AI collaboration. When making decisions without AI or with AI that exhibits no race-based preferences, people select all candidates at equal rates. However, when interacting with AI favoring a particular group, people also favor those candidates up to 90% of the time, indicating a significant behavioral shift. The likelihood of selecting candidates whose identities do not align with common race-status stereotypes can increase by 13% if people complete an IAT before conducting resume screening. Finally, even if people think AI recommendations are low quality or not important, their decisions are still vulnerable to AI bias under certain circumstances. This work has implications for people's autonomy in AI-HITL scenarios, AI and work, design and evaluation of AI hiring systems, and strategies for mitigating bias in collaborative decision-making tasks. In particular, organizational and regulatory policy should acknowledge the complex nature of AI-HITL decision making when implementing these systems, educating people who use them, and determining which are subject to oversight.
中文摘要:本研究表明,人类的招聘决策会受到人工智能种族偏见的显著影响,最高可达90%的模仿率,但通过内隐联想测试提升认知后,这种影响可降低13%。
English Summary: This study reveals that people's hiring decisions are significantly influenced by AI's racial biases, often mirroring them up to 90% of the time, though awareness through implicit association tests can reduce this effect by 13%.

Authors:Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, Lei Bai
Title: Transition Models: Rethinking the Generative Learning Objective
Abstract:
A fundamental dilemma in generative modeling persists: iterative diffusion models achieve outstanding fidelity, but at a significant computational cost, while efficient few-step alternatives are constrained by a hard quality ceiling. This conflict between generation steps and output quality arises from restrictive training objectives that focus exclusively on either infinitesimal dynamics (PF-ODEs) or direct endpoint prediction. We address this challenge by introducing an exact, continuous-time dynamics equation that analytically defines state transitions across any finite time interval. This leads to a novel generative paradigm, Transition Models (TiM), which adapt to arbitrary-step transitions, seamlessly traversing the generative trajectory from single leaps to fine-grained refinement with more steps. Despite having only 865M parameters, TiM achieves state-of-the-art performance, surpassing leading models such as SD3.5 (8B parameters) and FLUX.1 (12B parameters) across all evaluated step counts. Importantly, unlike previous few-step generators, TiM demonstrates monotonic quality improvement as the sampling budget increases. Additionally, when employing our native-resolution strategy, TiM delivers exceptional fidelity at resolutions up to 4096x4096.
中文摘要:过渡模型(TiM)通过引入连续时间动态方程,解决了生成模型中计算效率与输出质量之间的固有矛盾,实现了任意步长的灵活转换,以更少参数达到顶尖性能,并能在增加采样步数时保持质量的单调提升。
English Summary: Transition Models (TiM) overcome the trade-off between computational efficiency and output quality in generative modeling by introducing a continuous-time dynamics equation that enables flexible step transitions, achieving state-of-the-art performance with fewer parameters while maintaining monotonic quality improvement with increased sampling steps.

Authors:Congbo Ma, Yuxia Wang, Jia Wu, Jian Yang, Jing Du, Zitai Qiu, Qing Li, Hu Wang, Preslav Nakov
Title: Explicit and Implicit Data Augmentation for Social Event Detection
Abstract:
Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes large language models to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score. The code is available at GitHub: https://github.com/congboma/SED-Aug.
中文: SED-Aug框架通过结合显式文本增强和隐式特征增强来改进社交媒体事件检测,在Twitter数据集上的F1分数显著提升了超过15%。
English: The SED-Aug framework enhances social event detection by combining explicit text and implicit feature augmentation, significantly improving F1 scores by over 15% on Twitter datasets.

Authors:Safouane El Ghazouali, Umberto Michelucci
Title: VisioFirm: Cross-Platform AI-assisted Annotation Tool for Computer Vision
Abstract:
AI models rely on annotated data to learn pattern and perform prediction. Annotation is usually a labor-intensive step that require associating labels ranging from a simple classification label to more complex tasks such as object detection, oriented bounding box estimation, and instance segmentation. Traditional tools often require extensive manual input, limiting scalability for large datasets. To address this, we introduce VisioFirm, an open-source web application designed to streamline image labeling through AI-assisted automation. VisioFirm integrates state-of-the-art foundation models into an interface with a filtering pipeline to reduce human-in-the-loop efforts. This hybrid approach employs CLIP combined with pre-trained detectors like Ultralytics models for common classes and zero-shot models such as Grounding DINO for custom labels, generating initial annotations with low-confidence thresholding to maximize recall. Through this framework, when tested on COCO-type of classes, initial prediction have been proven to be mostly correct though the users can refine these via interactive tools supporting bounding boxes, oriented bounding boxes, and polygons. Additionally, VisioFirm has on-the-fly segmentation powered by Segment Anything accelerated through WebGPU for browser-side efficiency. The tool supports multiple export formats (YOLO, COCO, Pascal VOC, CSV) and operates offline after model caching, enhancing accessibility. VisioFirm demonstrates up to 90\% reduction in manual effort through benchmarks on diverse datasets, while maintaining high annotation accuracy via clustering of connected CLIP-based disambiguate components and IoU-graph for redundant detection suppression. VisioFirm can be accessed from \href{https://github.com/OschAI/VisioFirm}{https://github.com/OschAI/VisioFirm}.
Chinese: VisioFirm 是一款开源网络应用程序,通过集成 CLIP 和 Grounding DINO 等基础模型实现AI辅助自动化图像标注,大幅减少人工操作,同时支持多种标注任务和导出格式。
English: VisioFirm is an open-source web application that leverages AI-assisted automation with foundation models like CLIP and Grounding DINO to streamline image labeling, significantly reducing manual effort while supporting various annotation tasks and export formats.

Authors:Orlando Castaneda, Kevin So-Tang, Kshitij Gurung
Title: Revisiting Simple Baselines for In-The-Wild Deepfake Detection
Abstract:
The widespread adoption of synthetic media demands accessible deepfake detectors and realistic benchmarks. While most existing research evaluates deepfake detectors on highly controlled datasets, we focus on the recently released "in-the-wild" benchmark, Deepfake-Eval-2024. Initial reporting on Deepfake-Eval-2024 showed that three finetuned open-source models achieve accuracies between 61% and 69%, significantly lagging behind the leading commercial deepfake detector with 82% accuracy. Our work revisits one of these baseline approaches, originally introduced by Ojha et al., which adapts standard pretrained vision backbones to produce generalizable deepfake detectors. We demonstrate that with better-tuned hyperparameters, this simple approach actually yields much higher performance -- 81% accuracy on Deepfake-Eval-2024 -- surpassing the previously reported accuracy of this baseline approach by 18% and competing with commercial deepfake detectors. We discuss tradeoffs in accuracy, computational costs, and interpretability, focusing on how practical these deepfake detectors might be when deployed in real-world settings. Our code can be found at https://github.com/Deepfake-Detection-KKO/deepfake-detection.
Chinese: 本研究证明,通过对基线深度伪造检测模型进行优化的超参数调整,在Deepfake-Eval-2024基准测试中达到了81%的准确率,可与商业检测器相媲美,同时揭示了实际部署中的权衡问题。
English: This study demonstrates that improved hyperparameter tuning of a baseline deepfake detection model achieves 81% accuracy on the Deepfake-Eval-2024 benchmark, rivaling commercial detectors while highlighting practical deployment tradeoffs.

Authors:Tarik Zaciragic, Aske Plaat, K. Joost Batenburg
Title: Analysis of Bluffing by DQN and CFR in Leduc Hold'em Poker
Abstract:
In the game of poker, being unpredictable, or bluffing, is an essential skill. When humans play poker, they bluff. However, most works on computer-poker focus on performance metrics such as win rates, while bluffing is overlooked. In this paper we study whether two popular algorithms, DQN (based on reinforcement learning) and CFR (based on game theory), exhibit bluffing behavior in Leduc Hold'em, a simplified version of poker. We designed an experiment where we let the DQN and CFR agent play against each other while we log their actions. We find that both DQN and CFR exhibit bluffing behavior, but they do so in different ways. Although both attempt to perform bluffs at different rates, the percentage of successful bluffs (where the opponent folds) is roughly the same. This suggests that bluffing is an essential aspect of the game, not of the algorithm. Future work should look at different bluffing styles and at the full game of poker. Code at https://github.com/TarikZ03/Bluffing-by-DQN-and-CFR-in-Leduc-Hold-em-Poker-Codebase.
中文摘要:本研究表明,在Leduc Hold'em扑克游戏中,DQN和CFR两种算法均表现出虚张声势行为,尽管虚张频率不同但成功率相近,说明虚张是游戏本质特征而非算法特性。
English Summary: This study demonstrates that both DQN and CFR algorithms exhibit bluffing behavior in Leduc Hold'em poker, with varying bluffing frequencies but similar success rates, indicating bluffing is inherent to the game rather than specific to algorithms.

Authors:Junqi Liao, Yaojun Wu, Chaoyi Lin, Zhipin Deng, Li Li, Dong Liu, Xiaoyan Sun
Title: EHVC: Efficient Hierarchical Reference and Quality Structure for Neural Video Coding
Abstract:
Neural video codecs (NVCs), leveraging the power of end-to-end learning, have demonstrated remarkable coding efficiency improvements over traditional video codecs. Recent research has begun to pay attention to the quality structures in NVCs, optimizing them by introducing explicit hierarchical designs. However, less attention has been paid to the reference structure design, which fundamentally should be aligned with the hierarchical quality structure. In addition, there is still significant room for further optimization of the hierarchical quality structure. To address these challenges in NVCs, we propose EHVC, an efficient hierarchical neural video codec featuring three key innovations: (1) a hierarchical multi-reference scheme that draws on traditional video codec design to align reference and quality structures, thereby addressing the reference-quality mismatch; (2) a lookahead strategy to utilize an encoder-side context from future frames to enhance the quality structure; (3) a layer-wise quality scale with random quality training strategy to stabilize quality structures during inference. With these improvements, EHVC achieves significantly superior performance to the state-of-the-art NVCs. Code will be released in: https://github.com/bytedance/NEVC.
中文摘要:提出的EHVC神经视频编解码器通过三项创新——层次化多参考对齐、前瞻上下文利用和稳定质量训练,优化了参考与质量结构,性能显著优于现有技术。
English Summary: The proposed EHVC neural video codec introduces three innovations—hierarchical multi-reference alignment, lookahead context utilization, and stabilized quality training—to significantly outperform existing codecs by optimizing reference and quality structures.

Authors:Quang-Huy Che, Duc-Khai Lam
Title: TriLiteNet: Lightweight Model for Multi-Task Visual Perception
Abstract:
Efficient perception models are essential for Advanced Driver Assistance Systems (ADAS), as these applications require rapid processing and response to ensure safety and effectiveness in real-world environments. To address the real-time execution needs of such perception models, this study introduces the TriLiteNet model. This model can simultaneously manage multiple tasks related to panoramic driving perception. TriLiteNet is designed to optimize performance while maintaining low computational costs. Experimental results on the BDD100k dataset demonstrate that the model achieves competitive performance across three key tasks: vehicle detection, drivable area segmentation, and lane line segmentation. Specifically, the TriLiteNet_{base} demonstrated a recall of 85.6% for vehicle detection, a mean Intersection over Union (mIoU) of 92.4% for drivable area segmentation, and an Acc of 82.3% for lane line segmentation with only 2.35M parameters and a computational cost of 7.72 GFLOPs. Our proposed model includes a tiny configuration with just 0.14M parameters, which provides a multi-task solution with minimal computational demand. Evaluated for latency and power consumption on embedded devices, TriLiteNet in both configurations shows low latency and reasonable power during inference. By balancing performance, computational efficiency, and scalability, TriLiteNet offers a practical and deployable solution for real-world autonomous driving applications. Code is available at https://github.com/chequanghuy/TriLiteNet.
中文摘要:本研究提出TriLiteNet模型,这是一种面向自动驾驶的高效多任务感知模型,在保持低计算成本和嵌入式设备低延迟的同时,能在车辆检测、可行驶区域分割和车道线分割三项关键任务中实现优越性能。
English Summary: This study introduces TriLiteNet, an efficient multi-task perception model for autonomous driving that achieves competitive performance in vehicle detection, drivable area segmentation, and lane line segmentation while maintaining low computational costs and latency on embedded devices.

Authors:Zeyu Gan, Hao Yi, Yong Liu
Title: CoT-Space: A Theoretical Framework for Internal Slow-Thinking via Reinforcement Learning
Abstract:
Reinforcement Learning (RL) has become a pivotal approach for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists, as traditional token-level RL frameworks fail to align with the reasoning-level nature of complex, multi-step thought processes like Chain-of-Thought (CoT). To address this challenge, we introduce CoT-Space, a novel theoretical framework that recasts LLM reasoning from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. This shift in perspective serves as a conceptual bridge, revitalizing foundational principles from classical learning theory to analyze the unique dynamics of LLMs. By analyzing this process from both a noise perspective and a risk perspective, we demonstrate that the convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. Furthermore, extensive experiments provide strong empirical validation for our theoretical findings. Our framework not only provides a coherent explanation for empirical phenomena such as overthinking but also offers a solid theoretical foundation to guide the future development of more effective and generalizable reasoning agents. We open-source our code at https://github.com/ZyGan1999/CoT-Space.
中文摘要:CoT-Space框架将大语言模型的推理重新定义为连续语义空间中的优化过程,通过连接经典学习理论解释了过度思考等现象,并为开发更优推理智能体奠定了理论基础。
English Summary: The CoT-Space framework redefines LLM reasoning as optimization in a continuous semantic space, bridging classical learning theory to explain phenomena like overthinking and providing a theoretical foundation for developing better reasoning agents.

Authors:Shiku Kaito, Shinnosuke Matsuo, Daiki Suehiro, Ryoma Bise
Title: Learning from Majority Label: A Novel Problem in Multi-class Multiple-Instance Learning
Abstract:
The paper proposes a novel multi-class Multiple-Instance Learning (MIL) problem called Learning from Majority Label (LML). In LML, the majority class of instances in a bag is assigned as the bag-level label. The goal of LML is to train a classification model that estimates the class of each instance using the majority label. This problem is valuable in a variety of applications, including pathology image segmentation, political voting prediction, customer sentiment analysis, and environmental monitoring. To solve LML, we propose a Counting Network trained to produce bag-level majority labels, estimated by counting the number of instances in each class. Furthermore, analysis experiments on the characteristics of LML revealed that bags with a high proportion of the majority class facilitate learning. Based on this result, we developed a Majority Proportion Enhancement Module (MPEM) that increases the proportion of the majority class by removing minority class instances within the bags. Experiments demonstrate the superiority of the proposed method on four datasets compared to conventional MIL methods. Moreover, ablation studies confirmed the effectiveness of each module. The code is available at \href{https://github.com/Shiku-Kaito/Learning-from-Majority-Label-A-Novel-Problem-in-Multi-class-Multiple-Instance-Learning}{here}.
Chinese: 本文提出了一种名为“从多数标签学习”的新型多类多示例学习问题,其中包标签由实例的多数类决定,并通过带有多数比例增强模块的计数网络在四个数据集上超越了传统方法。
English: This paper introduces a novel multi-class multiple-instance learning problem called Learning from Majority Label (LML), where bag labels are determined by the majority class of instances, and proposes a Counting Network with a Majority Proportion Enhancement Module that outperforms conventional methods across four datasets.

Authors:Or Shachar, Uri Katz, Yoav Goldberg, Oren Glickman
Title: NER Retriever: Zero-Shot Named Entity Retrieval with Type-Aware Embeddings
Abstract:
We present NER Retriever, a zero-shot retrieval framework for ad-hoc Named Entity Retrieval, a variant of Named Entity Recognition (NER), where the types of interest are not provided in advance, and a user-defined type description is used to retrieve documents mentioning entities of that type. Instead of relying on fixed schemas or fine-tuned models, our method builds on internal representations of large language models (LLMs) to embed both entity mentions and user-provided open-ended type descriptions into a shared semantic space. We show that internal representations, specifically the value vectors from mid-layer transformer blocks, encode fine-grained type information more effectively than commonly used top-layer embeddings. To refine these representations, we train a lightweight contrastive projection network that aligns type-compatible entities while separating unrelated types. The resulting entity embeddings are compact, type-aware, and well-suited for nearest-neighbor search. Evaluated on three benchmarks, NER Retriever significantly outperforms both lexical and dense sentence-level retrieval baselines. Our findings provide empirical support for representation selection within LLMs and demonstrate a practical solution for scalable, schema-free entity retrieval. The NER Retriever Codebase is publicly available at https://github.com/ShacharOr100/ner_retriever
中文: NER检索器是一种零样本检索框架,利用大型语言模型的内部表示将实体和用户定义的类型描述嵌入共享语义空间,无需预定义模式即可在实体检索基准上实现卓越性能。
English: NER Retriever is a zero-shot framework that leverages large language models' internal representations to embed entities and user-defined type descriptions into a shared semantic space, achieving superior performance on entity retrieval benchmarks without predefined schemas.

Authors:Zhaoyan Gong, Juan Li, Zhiqiang Liu, Lei Liang, Huajun Chen, Wen Zhang
Title: RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models
Abstract:
Current temporal knowledge graph question answering (TKGQA) methods primarily focus on implicit temporal constraints, lacking the capability of handling more complex temporal queries, and struggle with limited reasoning abilities and error propagation in decomposition frameworks. We propose RTQA, a novel framework to address these challenges by enhancing reasoning over TKGs without requiring training. Following recursive thinking, RTQA recursively decomposes questions into sub-problems, solves them bottom-up using LLMs and TKG knowledge, and employs multi-path answer aggregation to improve fault tolerance. RTQA consists of three core components: the Temporal Question Decomposer, the Recursive Solver, and the Answer Aggregator. Experiments on MultiTQ and TimelineKGQA benchmarks demonstrate significant Hits@1 improvements in "Multiple" and "Complex" categories, outperforming state-of-the-art methods. Our code and data are available at https://github.com/zjukg/RTQA.
中文: RTQA是一种新颖的框架,通过递归分解复杂查询为子问题,利用大语言模型和时序知识图谱知识自底向上求解,并采用多路径答案聚合提升容错性,在基准测试中实现了最先进的性能表现。
English: RTQA is a novel framework that enhances reasoning over temporal knowledge graphs by recursively decomposing complex queries into sub-problems, solving them with LLMs and TKG knowledge, and aggregating answers for improved fault tolerance, achieving state-of-the-art performance on benchmarks.

Authors:Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, C. L. Philip Chen
Title: Multimodal Feature Fusion Network with Text Difference Enhancement for Remote Sensing Change Detection
Abstract:
Although deep learning has advanced remote sensing change detection (RSCD), most methods rely solely on image modality, limiting feature representation, change pattern modeling, and generalization especially under illumination and noise disturbances. To address this, we propose MMChange, a multimodal RSCD method that combines image and text modalities to enhance accuracy and robustness. An Image Feature Refinement (IFR) module is introduced to highlight key regions and suppress environmental noise. To overcome the semantic limitations of image features, we employ a vision language model (VLM) to generate semantic descriptions of bitemporal images. A Textual Difference Enhancement (TDE) module then captures fine grained semantic shifts, guiding the model toward meaningful changes. To bridge the heterogeneity between modalities, we design an Image Text Feature Fusion (ITFF) module that enables deep cross modal integration. Extensive experiments on LEVIRCD, WHUCD, and SYSUCD demonstrate that MMChange consistently surpasses state of the art methods across multiple metrics, validating its effectiveness for multimodal RSCD. Code is available at: https://github.com/yikuizhai/MMChange.
中文: 提出的MMChange方法通过融合图像与文本模态,借助专门模块提升遥感变化检测的精度和鲁棒性,在多项实验中均优于现有最优方法。
English: The proposed MMChange method enhances remote sensing change detection by integrating image and text modalities through specialized modules to improve accuracy and robustness, outperforming state-of-the-art approaches in experiments.

Authors:Ruiling Guo, Xinwei Yang, Chen Huang, Tong Zhang, Yong Hu
Title: CANDY: Benchmarking LLMs' Limitations and Assistive Potential in Chinese Misinformation Fact-Checking
Abstract:
The effectiveness of large language models (LLMs) to fact-check misinformation remains uncertain, despite their growing use. To this end, we present CANDY, a benchmark designed to systematically evaluate the capabilities and limitations of LLMs in fact-checking Chinese misinformation. Specifically, we curate a carefully annotated dataset of ~20k instances. Our analysis shows that current LLMs exhibit limitations in generating accurate fact-checking conclusions, even when enhanced with chain-of-thought reasoning and few-shot prompting. To understand these limitations, we develop a taxonomy to categorize flawed LLM-generated explanations for their conclusions and identify factual fabrication as the most common failure mode. Although LLMs alone are unreliable for fact-checking, our findings indicate their considerable potential to augment human performance when deployed as assistive tools in scenarios. Our dataset and code can be accessed at https://github.com/SCUNLP/CANDY
中文摘要:CANDY基准测试表明,尽管大型语言模型在中文不实信息核查中存在事实捏造等局限,但作为辅助工具仍具备提升人类核查能力的潜力。
English Summary: The CANDY benchmark reveals that large language models currently struggle with accurate Chinese misinformation fact-checking due to frequent factual fabrication, yet they show promise as assistive tools for human fact-checkers.

Authors:Minghui Zhang, Yaoyu Liu, Junyang Wu, Xin You, Hanxiao Zhang, Junjun He, Yun Gu
Title: TopoSculpt: Betti-Steered Topological Sculpting of 3D Fine-grained Tubular Shapes
Abstract:
Medical tubular anatomical structures are inherently three-dimensional conduits with lumens, enclosing walls, and complex branching topologies. Accurate reconstruction of their geometry and topology is crucial for applications such as bronchoscopic navigation and cerebral arterial connectivity assessment. Existing methods often rely on voxel-wise overlap measures, which fail to capture topological correctness and completeness. Although topology-aware losses and persistent homology constraints have shown promise, they are usually applied patch-wise and cannot guarantee global preservation or correct geometric errors at inference. To address these limitations, we propose a novel TopoSculpt, a framework for topological refinement of 3D fine-grained tubular structures. TopoSculpt (i) adopts a holistic whole-region modeling strategy to capture full spatial context, (ii) first introduces a Topological Integrity Betti (TIB) constraint that jointly enforces Betti number priors and global integrity, and (iii) employs a curriculum refinement scheme with persistent homology to progressively correct errors from coarse to fine scales. Extensive experiments on challenging pulmonary airway and Circle of Willis datasets demonstrate substantial improvements in both geometry and topology. For instance, $β_{0}$ errors are reduced from 69.00 to 3.40 on the airway dataset and from 1.65 to 0.30 on the CoW dataset, with Tree length detected and branch detected rates improving by nearly 10\%. These results highlight the effectiveness of TopoSculpt in correcting critical topological errors and advancing the high-fidelity modeling of complex 3D tubular anatomy. The project homepage is available at: https://github.com/Puzzled-Hui/TopoSculpt.
中文: TopoSculpt框架通过整体区域建模、拓扑完整性约束和渐进式优化策略,显著提升了三维管状解剖结构的几何精度与拓扑正确性,在气道和脑动脉数据上取得突破性改进。
English: The proposed TopoSculpt framework introduces a holistic topological refinement approach for 3D tubular structures, employing Betti number constraints and persistent homology to significantly improve geometric accuracy and topological integrity across medical datasets.

Authors:Yuqing Huang, Rongyang Zhang, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Xuyang Zhi, Guiquan Liu, Xin Li, Hao Wang, Enhong Chen
Title: SelfAug: Mitigating Catastrophic Forgetting in Retrieval-Augmented Generation via Distribution Self-Alignment
Abstract:
Recent advancements in large language models (LLMs) have revolutionized natural language processing through their remarkable capabilities in understanding and executing diverse tasks. While supervised fine-tuning, particularly in Retrieval-Augmented Generation (RAG) scenarios, effectively enhances task-specific performance, it often leads to catastrophic forgetting, where models lose their previously acquired knowledge and general capabilities. Existing solutions either require access to general instruction data or face limitations in preserving the model's original distribution. To overcome these limitations, we propose SelfAug, a self-distribution alignment method that aligns input sequence logits to preserve the model's semantic distribution, thereby mitigating catastrophic forgetting and improving downstream performance. Extensive experiments demonstrate that SelfAug achieves a superior balance between downstream learning and general capability retention. Our comprehensive empirical analysis reveals a direct correlation between distribution shifts and the severity of catastrophic forgetting in RAG scenarios, highlighting how the absence of RAG capabilities in general instruction tuning leads to significant distribution shifts during fine-tuning. Our findings not only advance the understanding of catastrophic forgetting in RAG contexts but also provide a practical solution applicable across diverse fine-tuning scenarios. Our code is publicly available at https://github.com/USTC-StarTeam/SelfAug.
中文: 提出的SelfAug方法通过对齐输入序列对数来保持模型语义分布,有效缓解了微调大语言模型中的灾难性遗忘问题,在下游任务性能和通用能力保留之间实现了更优的平衡。
English: The proposed SelfAug method effectively mitigates catastrophic forgetting in fine-tuned LLMs by aligning input sequence logits to preserve the model's semantic distribution, achieving superior balance between downstream task performance and general capability retention.

Authors:Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao
Title: MTQA:Matrix of Thought for Enhanced Reasoning in Complex Question Answering
Abstract:
Complex Question Answering (QA) is a fundamental and challenging task in NLP. While large language models (LLMs) exhibit impressive performance in QA, they suffer from significant performance degradation when facing complex and abstract QA tasks due to insufficient reasoning capabilities. Works such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) aim to enhance LLMs' reasoning abilities, but they face issues such as in-layer redundancy in tree structures and single paths in chain structures. Although some studies utilize Retrieval-Augmented Generation (RAG) methods to assist LLMs in reasoning, the challenge of effectively utilizing large amounts of information involving multiple entities and hops remains critical. To address this, we propose the Matrix of Thought (MoT), a novel and efficient LLM thought structure. MoT explores the problem in both horizontal and vertical dimensions through the "column-cell communication" mechanism, enabling LLMs to actively engage in multi-strategy and deep-level thinking, reducing redundancy within the column cells and enhancing reasoning capabilities. Furthermore, we develop a fact-correction mechanism by constructing knowledge units from retrieved knowledge graph triples and raw text to enhance the initial knowledge for LLM reasoning and correct erroneous answers. This leads to the development of an efficient and accurate QA framework (MTQA). Experimental results show that our framework outperforms state-of-the-art methods on four widely-used datasets in terms of F1 and EM scores, with reasoning time only 14.4\% of the baseline methods, demonstrating both its efficiency and accuracy. The code for this framework is available at https://github.com/lyfiter/mtqa.
中文摘要:针对大型语言模型在复杂任务中的推理缺陷,本文提出的思维矩阵(MoT)通过纵横维度的"列-单元通信"机制实现多策略深度思考,结合事实校正机制有效提升推理能力与效率,实验证明其性能显著优于现有方法。
English Summary: The Matrix of Thought (MoT) is introduced as an efficient reasoning structure for Large Language Models, addressing limitations in existing methods by enabling multi-dimensional thinking and reducing redundancy while incorporating fact-correction mechanisms to enhance accuracy and speed.

Authors:Fengxiao Tang, Yufeng Li, Zongzong Wu, Ming Zhao
Title: Chain or tree? Re-evaluating complex reasoning from the perspective of a matrix of thought
Abstract:
Large Language Models (LLMs) face significant accuracy degradation due to insufficient reasoning ability when dealing with complex and abstract tasks. Thought structures such as Chain of Thought (CoT) and Tree of Thought (ToT) focus on enhancing the reasoning capability of LLMs. However, they suffer from inherent drawbacks such as redundancy within the same layer of the tree structure and the singularity of the paths in the chain structure. Some studies have utilized Retrieval-Augmented Generation (RAG) methods to enhance CoT and ToT in mitigating hallucinations in LLMs, yet the fundamental shortcomings of the thought structures still persist. Furthermore, when dealing with multi-entity and multi-hop information, the retrieved verification knowledge often contains large amounts of fragmented, superficial, or even erroneous data, misleading the reasoning process of LLMs. To address these issues, we propose the Matrix of Thought (MoT), a novel and efficient thought structure for LLMs. MoT explores problems in both horizontal and vertical dimensions through a "column-cell communication" mechanism, enabling LLMs to actively engage in multi-strategy and deep thinking while reducing redundancy in the thought nodes within the column cells, thereby enhancing the reasoning capability of LLMs. Additionally, through a fact-correction mechanism, it leverages the knowledge graph triples retrieved by RAG and the original text to construct knowledge units and correct erroneous answers. To validate the effectiveness of this method, we conducted extensive experiments in three tasks: 24-point game, question answering evaluation, and proposition writing.The results demonstrate that our framework outperforms state-of-the-art methods, with reasoning time only 14.4\% of that of the baseline method, proving its efficiency and accuracy. The code for framework is available at https://github.com/lyfiter/mtqa.
中文摘要:针对大型语言模型在复杂任务中的推理缺陷,本文提出的思维矩阵(MoT)通过纵横维度的"列-单元通信"机制实现多策略深度思考,结合事实校正机制有效提升推理能力与效率,实验证明其性能显著优于现有方法。
English Summary: The Matrix of Thought (MoT) is introduced as an efficient reasoning structure for Large Language Models, addressing limitations in existing methods by enabling multi-dimensional thinking and reducing redundancy while incorporating fact-correction mechanisms to enhance accuracy and speed.

Authors:Xiaofu Chen, Israfel Salazar, Yova Kementchedjhieva
Title: SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation
Abstract:
As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.Our code can be found at https://github.com/mbzuai-nlp/SPECS.
中文:SPECS是一种针对长图像描述的新型高效评估指标,在保持与人类判断高度相关的同时,显著提升了计算效率,可作为模型开发中的实用工具。
English: SPECS is a new efficient metric for evaluating long image captions that matches the performance of LLM-based metrics in human judgment correlation while being significantly faster.

Authors:Gowen Loo, Chang Liu, Qinghong Yin, Xiang Chen, Jiawei Chen, Jingyuan Zhang, Yu Tian
Title: MobileRAG: Enhancing Mobile Agent with Retrieval-Augmented Generation
Abstract:
Smartphones have become indispensable in people's daily lives, permeating nearly every aspect of modern society. With the continuous advancement of large language models (LLMs), numerous LLM-based mobile agents have emerged. These agents are capable of accurately parsing diverse user queries and automatically assisting users in completing complex or repetitive operations. However, current agents 1) heavily rely on the comprehension ability of LLMs, which can lead to errors caused by misoperations or omitted steps during tasks, 2) lack interaction with the external environment, often terminating tasks when an app cannot fulfill user queries, and 3) lack memory capabilities, requiring each instruction to reconstruct the interface and being unable to learn from and correct previous mistakes. To alleviate the above issues, we propose MobileRAG, a mobile agents framework enhanced by Retrieval-Augmented Generation (RAG), which includes InterRAG, LocalRAG, and MemRAG. It leverages RAG to more quickly and accurately identify user queries and accomplish complex and long-sequence mobile tasks. Additionally, to more comprehensively assess the performance of MobileRAG, we introduce MobileRAG-Eval, a more challenging benchmark characterized by numerous complex, real-world mobile tasks that require external knowledge assistance. Extensive experimental results on MobileRAG-Eval demonstrate that MobileRAG can easily handle real-world mobile tasks, achieving 10.3\% improvement over state-of-the-art methods with fewer operational steps. Our code is publicly available at: https://github.com/liuxiaojieOutOfWorld/MobileRAG_arxiv
中文:MobileRAG是一种基于检索增强生成技术的新型移动代理框架,通过提升查询准确性、实现环境交互和集成记忆功能,能够以更少操作步骤高效处理复杂移动任务,有效解决了现有系统依赖性强、交互不足和缺乏记忆的问题。
English: MobileRAG is a novel mobile agent framework enhanced by Retrieval-Augmented Generation that addresses current limitations by improving query accuracy, enabling environmental interaction, and incorporating memory capabilities to handle complex mobile tasks more efficiently with fewer errors.

Authors:Cheng Wang, Zeming Wei, Qin Liu, Muhao Chen
Title: False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Abstract:
Large Language Models (LLMs) can comply with harmful instructions, raising serious safety concerns despite their impressive capabilities. Recent work has leveraged probing-based approaches to study the separability of malicious and benign inputs in LLMs' internal representations, and researchers have proposed using such probing methods for safety detection. We systematically re-examine this paradigm. Motivated by poor out-of-distribution performance, we hypothesize that probes learn superficial patterns rather than semantic harmfulness. Through controlled experiments, we confirm this hypothesis and identify the specific patterns learned: instructional patterns and trigger words. Our investigation follows a systematic approach, progressing from demonstrating comparable performance of simple n-gram methods, to controlled experiments with semantically cleaned datasets, to detailed analysis of pattern dependencies. These results reveal a false sense of security around current probing-based approaches and highlight the need to redesign both models and evaluation protocols, for which we provide further discussions in the hope of suggesting responsible further research in this direction. We have open-sourced the project at https://github.com/WangCheng0116/Why-Probe-Fails.
中文: 本研究系统审视了基于探测的大语言模型安全检测方法,发现其依赖表层模式而非语义理解,揭示了当前安全评估方法的局限性。
English: This study critically examines probing-based safety detection methods in Large Language Models, revealing that they rely on superficial patterns rather than semantic understanding, thereby exposing limitations in current safety evaluation approaches.

Authors:Haiwei Xue, Xiangyang Luo, Zhanghao Hu, Xin Zhang, Xunzhi Xiang, Yuqin Dai, Jianzhuang Liu, Zhensong Zhang, Minglei Li, Jian Yang, Fei Ma, Zhiyong Wu, Changpeng Yang, Zonghong Dai, Fei Richard Yu
Title: Human Motion Video Generation: A Survey
Abstract:
Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.
中文: 本综述首次全面概述人体运动视频生成领域,涵盖十余项子任务并详述生成过程的五个阶段,特别探讨了大语言模型的潜力,系统回顾了视觉、文本和音频三大模态的技术进展。
English: This survey provides the first comprehensive overview of human motion video generation, covering over ten sub-tasks and detailing the five-phase generation process while highlighting the potential of large language models and reviewing developments across vision, text, and audio modalities.

Authors:Jiajun Song, Xiaoou Liu
Title: SalientFusion: Context-Aware Compositional Zero-Shot Food Recognition
Abstract:
Food recognition has gained significant attention, but the rapid emergence of new dishes requires methods for recognizing unseen food categories, motivating Zero-Shot Food Learning (ZSFL). We propose the task of Compositional Zero-Shot Food Recognition (CZSFR), where cuisines and ingredients naturally align with attributes and objects in Compositional Zero-Shot learning (CZSL). However, CZSFR faces three challenges: (1) Redundant background information distracts models from learning meaningful food features, (2) Role confusion between staple and side dishes leads to misclassification, and (3) Semantic bias in a single attribute can lead to confusion of understanding. Therefore, we propose SalientFusion, a context-aware CZSFR method with two components: SalientFormer, which removes background redundancy and uses depth features to resolve role confusion; DebiasAT, which reduces the semantic bias by aligning prompts with visual features. Using our proposed benchmarks, CZSFood-90 and CZSFood-164, we show that SalientFusion achieves state-of-the-art results on these benchmarks and the most popular general datasets for the general CZSL. The code is avaliable at https://github.com/Jiajun-RUC/SalientFusion.
中文: 本研究提出组合零样本食物识别任务以解决背景干扰和语义偏差等挑战,通过SalientFusion方法在新基准和通用数据集上取得了最优性能。
English: The study introduces Compositional Zero-Shot Food Recognition (CZSFR) to address challenges like background distraction and semantic bias, proposing the SalientFusion method which achieves state-of-the-art results on new benchmarks and general datasets.

Authors:Nan Yang, Yang Wang, Zhanwen Liu, Yuchao Dai, Yang Liu, Xiangmo Zhao
Title: Focus Through Motion: RGB-Event Collaborative Token Sparsification for Efficient Object Detection
Abstract:
Existing RGB-Event detection methods process the low-information regions of both modalities (background in images and non-event regions in event data) uniformly during feature extraction and fusion, resulting in high computational costs and suboptimal performance. To mitigate the computational redundancy during feature extraction, researchers have respectively proposed token sparsification methods for the image and event modalities. However, these methods employ a fixed number or threshold for token selection, hindering the retention of informative tokens for samples with varying complexity. To achieve a better balance between accuracy and efficiency, we propose FocusMamba, which performs adaptive collaborative sparsification of multimodal features and efficiently integrates complementary information. Specifically, an Event-Guided Multimodal Sparsification (EGMS) strategy is designed to identify and adaptively discard low-information regions within each modality by leveraging scene content changes perceived by the event camera. Based on the sparsification results, a Cross-Modality Focus Fusion (CMFF) module is proposed to effectively capture and integrate complementary features from both modalities. Experiments on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that the proposed method achieves superior performance in both accuracy and efficiency compared to existing methods. The code will be available at https://github.com/Zizzzzzzz/FocusMamba.
中文: FocusMamba提出了一种自适应协同稀疏化方法,通过事件引导策略智能剔除RGB-事件数据中的低信息区域并融合互补特征,在基准数据集上实现了精度与效率的双重提升。
English: FocusMamba introduces an adaptive collaborative sparsification method that efficiently discards low-information regions in RGB-Event data and integrates complementary features, achieving superior accuracy and efficiency on benchmark datasets.

Authors:Yanbo Wang, Yongcan Yu, Jian Liang, Ran He
Title: A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models
Abstract:
The development of Long-CoT reasoning has advanced LLM performance across various tasks, including language understanding, complex problem solving, and code generation. This paradigm enables models to generate intermediate reasoning steps, thereby improving both accuracy and interpretability. However, despite these advancements, a comprehensive understanding of how CoT-based reasoning affects the trustworthiness of language models remains underdeveloped. In this paper, we survey recent work on reasoning models and CoT techniques, focusing on five core dimensions of trustworthy reasoning: truthfulness, safety, robustness, fairness, and privacy. For each aspect, we provide a clear and structured overview of recent studies in chronological order, along with detailed analyses of their methodologies, findings, and limitations. Future research directions are also appended at the end for reference and discussion. Overall, while reasoning techniques hold promise for enhancing model trustworthiness through hallucination mitigation, harmful content detection, and robustness improvement, cutting-edge reasoning models themselves often suffer from comparable or even greater vulnerabilities in safety, robustness, and privacy. By synthesizing these insights, we hope this work serves as a valuable and timely resource for the AI safety community to stay informed on the latest progress in reasoning trustworthiness. A full list of related papers can be found at \href{https://github.com/ybwang119/Awesome-reasoning-safety}{https://github.com/ybwang119/Awesome-reasoning-safety}.
中文: 本文综述了思维链推理对语言模型可信度在真实性、安全性、鲁棒性、公平性和隐私性五个维度的影响,发现其虽提升准确性和可解释性,但也带来脆弱性,为AI安全研究提供了全面参考。
English: This paper surveys how Chain-of-Thought reasoning impacts language model trustworthiness across five dimensions—truthfulness, safety, robustness, fairness, and privacy—finding that while it enhances accuracy and interpretability, it also introduces vulnerabilities, providing a comprehensive resource for AI safety research.

Authors:Huhong Xian, Rui Liu, Berrak Sisman, Haizhou Li
Title: NE-PADD: Leveraging Named Entity Knowledge for Robust Partial Audio Deepfake Detection via Attention Aggregation
Abstract:
Different from traditional sentence-level audio deepfake detection (ADD), partial audio deepfake detection (PADD) requires frame-level positioning of the location of fake speech. While some progress has been made in this area, leveraging semantic information from audio, especially named entities, remains an underexplored aspect. To this end, we propose NE-PADD, a novel method for Partial Audio Deepfake Detection (PADD) that leverages named entity knowledge through two parallel branches: Speech Name Entity Recognition (SpeechNER) and PADD. The approach incorporates two attention aggregation mechanisms: Attention Fusion (AF) for combining attention weights and Attention Transfer (AT) for guiding PADD with named entity semantics using an auxiliary loss. Built on the PartialSpoof-NER dataset, experiments show our method outperforms existing baselines, proving the effectiveness of integrating named entity knowledge in PADD. The code is available at https://github.com/AI-S2-Lab/NE-PADD.
中文: 本文提出NE-PADD方法,通过并行语音命名实体识别和检测分支结合注意力机制,利用命名实体知识实现局部音频深度伪造检测,实验证明其性能优于现有基线模型。
English: This paper introduces NE-PADD, a novel method for partial audio deepfake detection that leverages named entity knowledge through parallel speech recognition and detection branches with attention mechanisms, demonstrating superior performance over existing baselines.

Authors:Xiannan Huang, Shuhan Qiu, Jiayuan Du, Chao Yang
Title: Online time series prediction using feature adjustment
Abstract:
Time series forecasting is of significant importance across various domains. However, it faces significant challenges due to distribution shift. This issue becomes particularly pronounced in online deployment scenarios where data arrives sequentially, requiring models to adapt continually to evolving patterns. Current time series online learning methods focus on two main aspects: selecting suitable parameters to update (e.g., final layer weights or adapter modules) and devising suitable update strategies (e.g., using recent batches, replay buffers, or averaged gradients). We challenge the conventional parameter selection approach, proposing that distribution shifts stem from changes in underlying latent factors influencing the data. Consequently, updating the feature representations of these latent factors may be more effective. To address the critical problem of delayed feedback in multi-step forecasting (where true values arrive much later than predictions), we introduce ADAPT-Z (Automatic Delta Adjustment via Persistent Tracking in Z-space). ADAPT-Z utilizes an adapter module that leverages current feature representations combined with historical gradient information to enable robust parameter updates despite the delay. Extensive experiments demonstrate that our method consistently outperforms standard base models without adaptation and surpasses state-of-the-art online learning approaches across multiple datasets. The code is available at https://github.com/xiannanhuang/ADAPT-Z.
中文:时间序列预测面临分布漂移的挑战,ADAPT-Z方法通过更新潜在特征表示并利用历史梯度进行鲁棒参数更新,有效应对延迟反馈问题,在多个数据集上超越了现有方法。
English: Time series forecasting faces challenges from distribution shifts, which ADAPT-Z addresses by updating latent feature representations and using historical gradients for robust parameter updates despite delayed feedback, outperforming existing methods.

Authors:Junhui Li, Chengbin Feng, Zhiwei Yang, Qi Mo, Wei Wang
Title: BIDO: A Unified Approach to Address Obfuscation and Concept Drift Challenges in Image-based Malware Detection
Abstract:
To identify malicious Android applications, various malware detection techniques have been proposed. Among them, image-based approaches are considered potential alternatives due to their efficiency and scalability. Recent studies have reported that these approaches suffer significant performance declines when confronted with obfuscation or concept drift. However, existing solutions often treat these two challenges as different problems, offering independent solutions. These techniques overlook the fact that both challenges share a common statistical root, out-of-distribution, and research from this perspective remains limited. In response, we propose BIDO, a hybrid image-based malware detector designed to enhance robustness against both obfuscation and concept drift simultaneously. Specifically, to improve the discriminative power of image features, we introduce a local feature selection module that identifies informative subregions within malware images. Second, to enhance feature robustness, we model pairwise cross-modal dependencies in an outer product space, enabling the extraction of stable co-occurrence patterns. Third, to ensure feature compactness, we design a learnable metric that pulls samples with identical labels closer while pushing apart those with different labels, regardless of obfuscation or concept drift. Extensive experiments on the real-world datasets demonstrate that BIDO significantly outperforms existing baselines, achieving higher robustness against both concept drift and obfuscation. The source code is available at: https://github.com/whatishope/BIDO/.
Chinese: 提出的BIDO框架通过局部特征选择、跨模态依赖建模和可学习度量,同时应对混淆和概念漂移问题,实验证明其在安卓恶意软件检测中具有更强的鲁棒性。
English: The proposed BIDO framework enhances Android malware detection by addressing obfuscation and concept drift through local feature selection, cross-modal dependency modeling, and a learnable metric, demonstrating superior robustness in experiments.

Authors:Shakiba Amirshahi, Amin Bigdeli, Charles L. A. Clarke, Amira Ghenai
Title: Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
Abstract:
Retrieval augmented generation (RAG) systems provide a method for factually grounding the responses of a Large Language Model (LLM) by providing retrieved evidence, or context, as support. Guided by this context, RAG systems can reduce hallucinations and expand the ability of LLMs to accurately answer questions outside the scope of their training data. Unfortunately, this design introduces a critical vulnerability: LLMs may absorb and reproduce misinformation present in retrieved evidence. This problem is magnified if retrieved evidence contains adversarial material explicitly intended to promulgate misinformation. This paper presents a systematic evaluation of RAG robustness in the health domain and examines alignment between model outputs and ground-truth answers. We focus on the health domain due to the potential for harm caused by incorrect responses, as well as the availability of evidence-based ground truth for many common health-related questions. We conduct controlled experiments using common health questions, varying both the type and composition of the retrieved documents (helpful, harmful, and adversarial) as well as the framing of the question by the user (consistent, neutral, and inconsistent). Our findings reveal that adversarial documents substantially degrade alignment, but robustness can be preserved when helpful evidence is also present in the retrieval pool. These findings offer actionable insights for designing safer RAG systems in high-stakes domains by highlighting the need for retrieval safeguards. To enable reproducibility and facilitate future research, all experimental results are publicly available in our github repository. https://github.com/shakibaam/RAG_ROBUSTNESS_EVAL
中文: 检索增强生成系统通过提供检索证据来提高大语言模型的准确性,但容易受到检索文档中错误信息的影响,尤其在医疗等高危领域,对抗性内容会显著降低模型输出与事实的一致性。
English: Retrieval augmented generation (RAG) systems enhance LLM accuracy by providing contextual evidence but are vulnerable to misinformation in retrieved documents, particularly in high-stakes domains like healthcare where adversarial content can significantly reduce output alignment with truth.

Authors:Joseph Rich, Conrad Oakes, Lior Pachter
Title: Optimizing alluvial plots
Abstract:
Alluvial plots can be effective for visualization of multivariate data, but rely on ordering of alluvia that can be non-trivial to arrange. We formulate two optimization problems that formalize the challenge of ordering and coloring partitions in alluvial plots. While solving these optimization problems is challenging in general, we show that the NeighborNet algorithm from phylogenetics can be adapted to provide excellent results in typical use cases. Our methods are implemented in a freely available R package available on GitHub at https://github.com/pachterlab/wompwomp
中文摘要:冲积图在数据排序和着色方面存在难题,通过优化问题和改进的NeighborNet算法得以解决,并已集成于开源R软件包中。
English Summary: Alluvial plots face challenges in ordering and coloring data, which are addressed through optimization problems and an adapted NeighborNet algorithm, implemented in a freely available R package.

Authors:Zongsen Qiu
Title: STA-Net: A Decoupled Shape and Texture Attention Network for Lightweight Plant Disease Classification
Abstract:
Responding to rising global food security needs, precision agriculture and deep learning-based plant disease diagnosis have become crucial. Yet, deploying high-precision models on edge devices is challenging. Most lightweight networks use attention mechanisms designed for generic object recognition, which poorly capture subtle pathological features like irregular lesion shapes and complex textures. To overcome this, we propose a twofold solution: first, using a training-free neural architecture search method (DeepMAD) to create an efficient network backbone for edge devices; second, introducing the Shape-Texture Attention Module (STAM). STAM splits attention into two branches -- one using deformable convolutions (DCNv4) for shape awareness and the other using a Gabor filter bank for texture awareness. On the public CCMT plant disease dataset, our STA-Net model (with 401K parameters and 51.1M FLOPs) reached 89.00% accuracy and an F1 score of 88.96%. Ablation studies confirm STAM significantly improves performance over baseline and standard attention models. Integrating domain knowledge via decoupled attention thus presents a promising path for edge-deployed precision agriculture AI. The source code is available at https://github.com/RzMY/STA-Net.
中文: 针对现有注意力机制难以捕捉植物病害细微特征的问题,本研究提出STA-Net轻量级模型,通过无训练神经网络架构搜索主干和新型形状-纹理双分支注意力模块,在植物病害数据集上实现89.00%的准确率,为边缘设备上的精准农业应用提供了有效解决方案。
English: To address the limitations of generic attention mechanisms in capturing subtle plant disease features, this study introduces STA-Net, a lightweight model combining a training-free neural architecture search backbone with a novel Shape-Texture Attention Module that achieves 89.00% accuracy on plant disease diagnosis while being optimized for edge deployment.

Authors:Taha Koleilat, Hassan Rivaz, Yiming Xiao
Title: Singular Value Few-shot Adaptation of Vision-Language Models
Abstract:
Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.
Chinese: CLIP-SVD 提出了一种基于奇异值分解的参数高效适配方法,通过微调CLIP内部参数,在多个数据集上以极少的参数量实现了优异的准确性和泛化能力。
English: CLIP-SVD introduces a parameter-efficient adaptation method using Singular Value Decomposition to fine-tune CLIP's internal parameters, achieving superior accuracy and generalization with minimal parameter usage across multiple datasets.

Authors:Casper van Engelenburg, Jan van Gemert, Seyran Khademi
Title: LayoutGKN: Graph Similarity Learning of Floor Plans
Abstract:
Floor plans depict building layouts and are often represented as graphs to capture the underlying spatial relationships. Comparison of these graphs is critical for applications like search, clustering, and data visualization. The most successful methods to compare graphs \ie, graph matching networks, rely on costly intermediate cross-graph node-level interactions, therefore being slow in inference time. We introduce \textbf{LayoutGKN}, a more efficient approach that postpones the cross-graph node-level interactions to the end of the joint embedding architecture. We do so by using a differentiable graph kernel as a distance function on the final learned node-level embeddings. We show that LayoutGKN computes similarity comparably or better than graph matching networks while significantly increasing the speed. \href{https://github.com/caspervanengelenburg/LayoutGKN}{Code and data} are open.
中文:LayoutGKN是一种更高效的图匹配方法,它将跨图节点交互推迟到最终嵌入阶段,在显著提升推理速度的同时,实现了与现有网络相当或更优的相似性计算效果。
English: LayoutGKN is a more efficient graph matching method that delays cross-graph interactions until the final embedding stage, achieving comparable or better similarity accuracy while significantly speeding up inference compared to existing networks.

Authors:Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez
Title: The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
Abstract:
Personality traits have long been studied as predictors of human behavior. Recent advances in Large Language Models (LLMs) suggest similar patterns may emerge in artificial systems, with advanced LLMs displaying consistent behavioral tendencies resembling human traits like agreeableness and self-regulation. Understanding these patterns is crucial, yet prior work primarily relied on simplified self-reports and heuristic prompting, with little behavioral validation. In this study, we systematically characterize LLM personality across three dimensions: (1) the dynamic emergence and evolution of trait profiles throughout training stages; (2) the predictive validity of self-reported traits in behavioral tasks; and (3) the impact of targeted interventions, such as persona injection, on both self-reports and behavior. Our findings reveal that instructional alignment (e.g., RLHF, instruction tuning) significantly stabilizes trait expression and strengthens trait correlations in ways that mirror human data. However, these self-reported traits do not reliably predict behavior, and observed associations often diverge from human patterns. While persona injection successfully steers self-reports in the intended direction, it exerts little or inconsistent effect on actual behavior. By distinguishing surface-level trait expression from behavioral consistency, our findings challenge assumptions about LLM personality and underscore the need for deeper evaluation in alignment and interpretability.
中文摘要:本研究系统分析了大语言模型的性格特征,发现虽然指令对齐能稳定类似人类的特质表达,但自我报告的特质无法可靠预测行为,且角色注入主要影响表面报告而非实际行为一致性。
English Summary: This study systematically examines LLM personality traits, revealing that while instructional alignment stabilizes trait expression similar to humans, self-reported traits fail to reliably predict behavior and persona injections primarily affect surface-level reports rather than actual behavioral consistency.

Authors:Seth Z. Zhao, Huizhi Zhang, Zhaowei Li, Juntong Peng, Anthony Chui, Zewei Zhou, Zonglin Meng, Hao Xiang, Zhiyu Huang, Fujia Wang, Ran Tian, Chenfeng Xu, Bolei Zhou, Jiaqi Ma
Title: QuantV2X: A Fully Quantized Multi-Agent System for Cooperative Perception
Abstract:
Cooperative perception through Vehicle-to-Everything (V2X) communication offers significant potential for enhancing vehicle perception by mitigating occlusions and expanding the field of view. However, past research has predominantly focused on improving accuracy metrics without addressing the crucial system-level considerations of efficiency, latency, and real-world deployability. Noticeably, most existing systems rely on full-precision models, which incur high computational and transmission costs, making them impractical for real-time operation in resource-constrained environments. In this paper, we introduce \textbf{QuantV2X}, the first fully quantized multi-agent system designed specifically for efficient and scalable deployment of multi-modal, multi-agent V2X cooperative perception. QuantV2X introduces a unified end-to-end quantization strategy across both neural network models and transmitted message representations that simultaneously reduces computational load and transmission bandwidth. Remarkably, despite operating under low-bit constraints, QuantV2X achieves accuracy comparable to full-precision systems. More importantly, when evaluated under deployment-oriented metrics, QuantV2X reduces system-level latency by 3.2$\times$ and achieves a +9.5 improvement in mAP30 over full-precision baselines. Furthermore, QuantV2X scales more effectively, enabling larger and more capable models to fit within strict memory budgets. These results highlight the viability of a fully quantized multi-agent intermediate fusion system for real-world deployment. The system will be publicly released to promote research in this field: https://github.com/ucla-mobility/QuantV2X.
中文: QuantV2X提出了首个完全量化的协同感知系统,通过统一量化策略在保持精度的同时大幅降低计算和传输开销,实现了3.2倍的延迟降低和更好的可扩展性,为V2X实际部署提供了可行方案。
English: QuantV2X introduces a fully quantized cooperative perception system that reduces computational and transmission costs while maintaining accuracy comparable to full-precision systems, achieving significant latency reduction and improved scalability for real-world V2X deployment.

Authors:Payam Abdisarabshali, Fardis Nadimi, Kasra Borazjani, Naji Khosravan, Minghui Liwang, Wei Ni, Dusit Niyato, Michael Langberg, Seyyedali Hosseinalipour
Title: Hierarchical Federated Foundation Models over Wireless Networks for Multi-Modal Multi-Task Intelligence: Integration of Edge Learning with D2D/P2P-Enabled Fog Learning Architectures
Abstract:
The rise of foundation models (FMs) has reshaped the landscape of machine learning. As these models continued to grow, leveraging geo-distributed data from wireless devices has become increasingly critical, giving rise to federated foundation models (FFMs). More recently, FMs have evolved into multi-modal multi-task (M3T) FMs (e.g., GPT-4) capable of processing diverse modalities across multiple tasks, which motivates a new underexplored paradigm: M3T FFMs. In this paper, we unveil an unexplored variation of M3T FFMs by proposing hierarchical federated foundation models (HF-FMs), which in turn expose two overlooked heterogeneity dimensions to fog/edge networks that have a direct impact on these emerging models: (i) heterogeneity in collected modalities and (ii) heterogeneity in executed tasks across fog/edge nodes. HF-FMs strategically align the modular structure of M3T FMs, comprising modality encoders, prompts, mixture-of-experts (MoEs), adapters, and task heads, with the hierarchical nature of fog/edge infrastructures. Moreover, HF-FMs enable the optional usage of device-to-device (D2D) communications, enabling horizontal module relaying and localized cooperative training among nodes when feasible. Through delving into the architectural design of HF-FMs, we highlight their unique capabilities along with a series of tailored future research directions. Finally, to demonstrate their potential, we prototype HF-FMs in a wireless network setting and release the open-source code for the development of HF-FMs with the goal of fostering exploration in this untapped field (GitHub: https://github.com/payamsiabd/M3T-FFM).
中文: 本文提出分层联邦基础模型(HF-FMs),通过将多模态多任务基础模型与雾计算/边缘网络层级对齐,解决模态和任务异质性,同时支持设备间通信和本地化协同训练。
English: The paper introduces hierarchical federated foundation models (HF-FMs), a novel paradigm that aligns multi-modal multi-task foundation models with fog/edge network hierarchies to address modality and task heterogeneity while enabling device-to-device communication and localized training.

Authors:Thomas R. Harvey
Title: The Optimiser Hidden in Plain Sight: Training with the Loss Landscape's Induced Metric
Abstract:
We present a class of novel optimisers for training neural networks that makes use of the Riemannian metric naturally induced when the loss landscape is embedded in higher-dimensional space. This is the same metric that underlies common visualisations of loss landscapes. By taking this geometric perspective literally and using the induced metric, we develop a new optimiser and compare it to existing methods, namely: SGD, Adam, AdamW, and Muon, across a range of tasks and architectures. Empirically, we conclude that this new class of optimisers is highly effective in low dimensional examples, and provides slight improvement over state-of-the-art methods for training neural networks. These new optimisers have theoretically desirable properties. In particular, the effective learning rate is automatically decreased in regions of high curvature acting as a smoothed out form of gradient clipping. Similarly, one variant of these optimisers can also be viewed as inducing an effective scheduled learning rate and decoupled weight decay is the natural choice from our geometric perspective. The basic method can be used to modify any existing preconditioning method. The new optimiser has a computational complexity comparable to that of Adam.
Chinese Summary: 本文提出了一类新颖的神经网络优化器,利用损失景观嵌入高维空间时自然诱导的黎曼度量,在低维示例中表现优异,相比现有最优方法略有提升,并具有理论优势如自适应学习率和解耦权重衰减。
English Summary: This paper introduces a novel class of optimizers for neural networks that leverage the Riemannian metric from embedding loss landscapes in higher dimensions, showing effectiveness in low-dimensional cases and slight improvements over state-of-the-art methods with desirable theoretical properties like adaptive learning rates.

Authors:Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang
Title: SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models
Abstract:
Proteins play crucial roles in almost all biological processes. The advancement of deep learning has greatly accelerated the development of protein foundation models, leading to significant successes in protein understanding and design. However, the lack of systematic red-teaming for these models has raised serious concerns about their potential misuse, such as generating proteins with biological safety risks. This paper introduces SafeProtein, the first red-teaming framework designed for protein foundation models to the best of our knowledge. SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red-teaming methods and conduct tests on protein foundation models. We also curated SafeProtein-Bench, which includes a manually constructed red-teaming benchmark dataset and a comprehensive evaluation protocol. SafeProtein achieved continuous jailbreaks on state-of-the-art protein foundation models (up to 70% attack success rate for ESM3), revealing potential biological safety risks in current protein foundation models and providing insights for the development of robust security protection technologies for frontier models. The codes will be made publicly available at https://github.com/jigang-fan/SafeProtein.
中文:本文提出了首个蛋白质基础模型红队测试框架SafeProtein,通过多模态提示工程和启发式束搜索方法,在先进模型上实现了高达70%的攻击成功率,揭示了当前蛋白质基础模型存在的生物安全风险。
English: This paper introduces SafeProtein, the first red-teaming framework for protein foundation models, which successfully exposed biological safety risks by achieving up to 70% attack success rates on state-of-the-art models through multimodal prompt engineering and heuristic beam search.

Authors:Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang
Title: SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models
Abstract:
Proteins play crucial roles in almost all biological processes. The advancement of deep learning has greatly accelerated the development of protein foundation models, leading to significant successes in protein understanding and design. However, the lack of systematic red-teaming for these models has raised serious concerns about their potential misuse, such as generating proteins with biological safety risks. This paper introduces SafeProtein, the first red-teaming framework designed for protein foundation models to the best of our knowledge. SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red-teaming methods and conduct tests on protein foundation models. We also curated SafeProtein-Bench, which includes a manually constructed red-teaming benchmark dataset and a comprehensive evaluation protocol. SafeProtein achieved continuous jailbreaks on state-of-the-art protein foundation models (up to 70% attack success rate for ESM3), revealing potential biological safety risks in current protein foundation models and providing insights for the development of robust security protection technologies for frontier models. The codes will be made publicly available at https://github.com/jigang-fan/SafeProtein.
中文:本文提出了首个蛋白质基础模型红队测试框架SafeProtein,通过多模态提示工程和启发式束搜索方法,在先进模型上实现了高达70%的攻击成功率,揭示了当前蛋白质基础模型存在的生物安全风险。
English: This paper introduces SafeProtein, the first red-teaming framework for protein foundation models, which successfully exposed biological safety risks by achieving up to 70% attack success rates on state-of-the-art models through multimodal prompt engineering and heuristic beam search.

Authors:Spyros Rigas, Dhruv Verma, Georgios Alexandridis, Yixuan Wang
Title: Initialization Schemes for Kolmogorov-Arnold Networks: An Empirical Study
Abstract:
Kolmogorov-Arnold Networks (KANs) are a recently introduced neural architecture that replace fixed nonlinearities with trainable activation functions, offering enhanced flexibility and interpretability. While KANs have been applied successfully across scientific and machine learning tasks, their initialization strategies remain largely unexplored. In this work, we study initialization schemes for spline-based KANs, proposing two theory-driven approaches inspired by LeCun and Glorot, as well as an empirical power-law family with tunable exponents. Our evaluation combines large-scale grid searches on function fitting and forward PDE benchmarks, an analysis of training dynamics through the lens of the Neural Tangent Kernel, and evaluations on a subset of the Feynman dataset. Our findings indicate that the Glorot-inspired initialization significantly outperforms the baseline in parameter-rich models, while power-law initialization achieves the strongest performance overall, both across tasks and for architectures of varying size. All code and data accompanying this manuscript are publicly available at https://github.com/srigas/KAN_Initialization_Schemes.
Chinese: 本研究探索了Kolmogorov-Arnold网络的初始化策略,发现幂律初始化在各类任务和模型规模中表现最优,而Glorot启发式方法在参数丰富的模型中表现突出。
English: This study explores initialization strategies for Kolmogorov-Arnold Networks, finding that power-law initialization delivers superior performance across various tasks and model sizes, while Glorot-inspired methods excel in parameter-rich models.

Authors:Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, Anurag Beniwal
Title: Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has emerged to be a predominant paradigm for mathematical reasoning tasks, offering stable improvements in reasoning ability. However, Outcome Reward Models (ORMs) in RLVR are too coarse-grained to distinguish flawed reasoning within correct answers or valid reasoning within incorrect answers. This lack of granularity introduces noisy and misleading gradients significantly and hinders further progress in reasoning process quality. While Process Reward Models (PRMs) offer fine-grained guidance for intermediate steps, they frequently suffer from inaccuracies and are susceptible to reward hacking. To resolve this dilemma, we introduce PRocess cOnsistency Filter (PROF), an effective data process curation method that harmonizes noisy, fine-grained process rewards with accurate, coarse-grained outcome rewards. Rather than naively blending PRM and ORM in the objective function (arXiv:archive/2506.18896), PROF leverages their complementary strengths through consistency-driven sample selection. Our approach retains correct responses with higher averaged process values and incorrect responses with lower averaged process values, while maintaining positive/negative training sample balance. Extensive experiments demonstrate that our method not only consistently improves the final accuracy over $4\%$ compared to the blending approaches, but also strengthens the quality of intermediate reasoning steps. Codes and training recipes are available at https://github.com/Chenluye99/PROF.
中文摘要:本文提出PROF方法,通过一致性驱动的样本选择协调细粒度过程奖励与粗粒度结果奖励,在提升数学推理最终准确率的同时强化中间推理步骤的质量。
English Summary: The paper introduces PROF, a method that combines fine-grained process rewards and coarse-grained outcome rewards through consistency-driven sample selection to enhance mathematical reasoning by improving both final accuracy and intermediate step quality.

Authors:Reina Ishikawa, Ryo Fujii, Hideo Saito, Ryo Hachiuma
Title: Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation
Abstract:
Evaluating concept customization is challenging, as it requires a comprehensive assessment of fidelity to generative prompts and concept images. Moreover, evaluating multiple concepts is considerably more difficult than evaluating a single concept, as it demands detailed assessment not only for each individual concept but also for the interactions among concepts. While humans can intuitively assess generated images, existing metrics often provide either overly narrow or overly generalized evaluations, resulting in misalignment with human preference. To address this, we propose Decomposed GPT Score (D-GPTScore), a novel human-aligned evaluation method that decomposes evaluation criteria into finer aspects and incorporates aspect-wise assessments using Multimodal Large Language Model (MLLM). Additionally, we release Human Preference-Aligned Concept Customization Benchmark (CC-AlignBench), a benchmark dataset containing both single- and multi-concept tasks, enabling stage-wise evaluation across a wide difficulty range -- from individual actions to multi-person interactions. Our method significantly outperforms existing approaches on this benchmark, exhibiting higher correlation with human preferences. This work establishes a new standard for evaluating concept customization and highlights key challenges for future research. The benchmark and associated materials are available at https://github.com/ReinaIshikawa/D-GPTScore.
中文: 本文提出了D-GPTScore方法,通过多模态大语言模型将评估标准分解为更细粒度,并发布了包含单概念与多概念任务的CC-AlignBench基准,在与人偏好的对齐度上显著优于现有方法。
English: This paper introduces D-GPTScore, a human-aligned evaluation method that decomposes assessment criteria into finer aspects using MLLM, and releases CC-AlignBench, a benchmark for single- and multi-concept tasks, achieving superior correlation with human preferences.

Authors:Yiyang Huang, Zixuan Wang, Zishen Wan, Yapeng Tian, Haobo Xu, Yinhe Han, Yiming Gan
Title: ANNIE: Be Careful of Your Robots
Abstract:
The integration of vision-language-action (VLA) models into embodied AI (EAI) robots is rapidly advancing their ability to perform complex, long-horizon tasks in humancentric environments. However, EAI systems introduce critical security risks: a compromised VLA model can directly translate adversarial perturbations on sensory input into unsafe physical actions. Traditional safety definitions and methodologies from the machine learning community are no longer sufficient. EAI systems raise new questions, such as what constitutes safety, how to measure it, and how to design effective attack and defense mechanisms in physically grounded, interactive settings. In this work, we present the first systematic study of adversarial safety attacks on embodied AI systems, grounded in ISO standards for human-robot interactions. We (1) formalize a principled taxonomy of safety violations (critical, dangerous, risky) based on physical constraints such as separation distance, velocity, and collision boundaries; (2) introduce ANNIEBench, a benchmark of nine safety-critical scenarios with 2,400 video-action sequences for evaluating embodied safety; and (3) ANNIE-Attack, a task-aware adversarial framework with an attack leader model that decomposes long-horizon goals into frame-level perturbations. Our evaluation across representative EAI models shows attack success rates exceeding 50% across all safety categories. We further demonstrate sparse and adaptive attack strategies and validate the real-world impact through physical robot experiments. These results expose a previously underexplored but highly consequential attack surface in embodied AI systems, highlighting the urgent need for security-driven defenses in the physical AI era. Code is available at https://github.com/RLCLab/Annie.
中文摘要:本研究首次系统性地探究具身AI系统的对抗性安全攻击,提出了基于物理约束的安全违规分类法、包含2400个视频动作序列的安全评估基准,以及任务感知的攻击框架,在各类安全场景中攻击成功率超过50%,揭示了物理AI时代亟待解决的安全漏洞。
English Summary: This study presents the first systematic investigation of adversarial safety attacks on embodied AI systems, introducing a taxonomy of safety violations, a benchmark for evaluation, and an attack framework that achieves over 50% success rate across safety categories, revealing critical security vulnerabilities in physically-grounded AI.

Authors:Hui Chen, Liangyu Liu, Xianchao Xiu, Wanquan Liu
Title: Transformer-Guided Content-Adaptive Graph Learning for Hyperspectral Unmixing
Abstract:
Hyperspectral unmixing (HU) targets to decompose each mixed pixel in remote sensing images into a set of endmembers and their corresponding abundances. Despite significant progress in this field using deep learning, most methods fail to simultaneously characterize global dependencies and local consistency, making it difficult to preserve both long-range interactions and boundary details. This letter proposes a novel transformer-guided content-adaptive graph unmixing framework (T-CAGU), which overcomes these challenges by employing a transformer to capture global dependencies and introducing a content-adaptive graph neural network to enhance local relationships. Unlike previous work, T-CAGU integrates multiple propagation orders to dynamically learn the graph structure, ensuring robustness against noise. Furthermore, T-CAGU leverages a graph residual mechanism to preserve global information and stabilize training. Experimental results demonstrate its superiority over the state-of-the-art methods. Our code is available at https://github.com/xianchaoxiu/T-CAGU.
中文: 该摘要提出T-CAGU新型高光谱解混框架,通过结合Transformer捕获全局依赖与自适应图网络增强局部关联,在性能上超越了现有先进方法。
English: This abstract introduces T-CAGU, a novel hyperspectral unmixing framework that combines transformers for global dependencies and adaptive graph networks for local consistency, achieving superior performance over existing methods.

Authors:Yixiong Jing, Cheng Zhang, Haibing Wu, Guangming Wang, Olaf Wysocki, Brian Sheil
Title: InfraDiffusion: zero-shot depth map restoration with diffusion models and prompted segmentation from sparse infrastructure point clouds
Abstract:
Point clouds are widely used for infrastructure monitoring by providing geometric information, where segmentation is required for downstream tasks such as defect detection. Existing research has automated semantic segmentation of structural components, while brick-level segmentation (identifying defects such as spalling and mortar loss) has been primarily conducted from RGB images. However, acquiring high-resolution images is impractical in low-light environments like masonry tunnels. Point clouds, though robust to dim lighting, are typically unstructured, sparse, and noisy, limiting fine-grained segmentation. We present InfraDiffusion, a zero-shot framework that projects masonry point clouds into depth maps using virtual cameras and restores them by adapting the Denoising Diffusion Null-space Model (DDNM). Without task-specific training, InfraDiffusion enhances visual clarity and geometric consistency of depth maps. Experiments on masonry bridge and tunnel point cloud datasets show significant improvements in brick-level segmentation using the Segment Anything Model (SAM), underscoring its potential for automated inspection of masonry assets. Our code and data is available at https://github.com/Jingyixiong/InfraDiffusion-official-implement.
Chinese Summary: InfraDiffusion是一种零样本框架,通过虚拟相机和扩散模型将砖石点云转换为增强的深度图,无需任务特定训练即可显著提升砖块级分割效果,实现基础设施自动化检测。
English Summary: InfraDiffusion is a zero-shot framework that converts masonry point clouds into enhanced depth maps using virtual cameras and diffusion models, significantly improving brick-level segmentation for automated infrastructure inspection without task-specific training.

Authors:Evgenii Kniazev, Arseny Kravchenko, Igor Rekun, James Broadhead, Nikita Shamgunov, Pranav Sah, Pratik Nichite, Ivan Yamshchikov
Title: app.build: A Production Framework for Scaling Agentic Prompt-to-App Generation with Environment Scaffolding
Abstract:
We present app.build (https://github.com/appdotbuild/agent/), an open-source framework that improves LLM-based application generation through systematic validation and structured environments. Our approach combines multi-layered validation pipelines, stack-specific orchestration, and model-agnostic architecture, implemented across three reference stacks. Through evaluation on 30 generation tasks, we demonstrate that comprehensive validation achieves 73.3% viability rate with 30% reaching perfect quality scores, while open-weights models achieve 80.8% of closed-model performance when provided structured environments. The open-source framework has been adopted by the community, with over 3,000 applications generated to date. This work demonstrates that scaling reliable AI agents requires scaling environments, not just models -- providing empirical insights and complete reference implementations for production-oriented agent systems.
中文:app.build框架通过系统验证和结构化环境提升基于LLM的应用程序生成效果,在开源社区中已实现广泛应用并验证了可靠智能体需扩展环境而不仅是模型的核心观点。
English: The app.build framework enhances LLM-based application generation via systematic validation and structured environments, achieving high viability and performance with open-source adoption.

Authors:Junhao Jia, Yifei Sun, Yunyou Liu, Cheng Yang, Changmiao Wang, Feiwei Qin, Yong Peng, Wenwen Min
Title: RTGMFF: Enhanced fMRI-based Brain Disorder Diagnosis via ROI-driven Text Generation and Multimodal Feature Fusion
Abstract:
Functional magnetic resonance imaging (fMRI) is a powerful tool for probing brain function, yet reliable clinical diagnosis is hampered by low signal-to-noise ratios, inter-subject variability, and the limited frequency awareness of prevailing CNN- and Transformer-based models. Moreover, most fMRI datasets lack textual annotations that could contextualize regional activation and connectivity patterns. We introduce RTGMFF, a framework that unifies automatic ROI-level text generation with multimodal feature fusion for brain-disorder diagnosis. RTGMFF consists of three components: (i) ROI-driven fMRI text generation deterministically condenses each subject's activation, connectivity, age, and sex into reproducible text tokens; (ii) Hybrid frequency-spatial encoder fuses a hierarchical wavelet-mamba branch with a cross-scale Transformer encoder to capture frequency-domain structure alongside long-range spatial dependencies; and (iii) Adaptive semantic alignment module embeds the ROI token sequence and visual features in a shared space, using a regularized cosine-similarity loss to narrow the modality gap. Extensive experiments on the ADHD-200 and ABIDE benchmarks show that RTGMFF surpasses current methods in diagnostic accuracy, achieving notable gains in sensitivity, specificity, and area under the ROC curve. Code is available at https://github.com/BeistMedAI/RTGMFF.
中文: RTGMFF框架通过从fMRI数据生成ROI级文本描述,并利用混合频空编码器和自适应语义对齐整合多模态特征,在ADHD-200和ABIDE基准测试中显著提升了脑部疾病诊断的准确性。
English: The RTGMFF framework enhances brain-disorder diagnosis by generating ROI-level text descriptions from fMRI data and integrating multimodal features through a hybrid frequency-spatial encoder and adaptive semantic alignment, demonstrating superior accuracy on ADHD-200 and ABIDE benchmarks.

Authors:Yuchen Yang, Yiming Li, Hongwei Yao, Enhao Huang, Shuo Shao, Bingrun Yang, Zhibo Wang, Dacheng Tao, Zhan Qin
Title: PromptCOS: Towards System Prompt Copyright Auditing for LLMs via Content-level Output Similarity
Abstract:
The rapid progress of large language models (LLMs) has greatly enhanced reasoning tasks and facilitated the development of LLM-based applications. A critical factor in improving LLM-based applications is the design of effective system prompts, which significantly impact the behavior and output quality of LLMs. However, system prompts are susceptible to theft and misuse, which could undermine the interests of prompt owners. Existing methods protect prompt copyrights through watermark injection and verification but face challenges due to their reliance on intermediate LLM outputs (e.g., logits), which limits their practical feasibility. In this paper, we propose PromptCOS, a method for auditing prompt copyright based on content-level output similarity. It embeds watermarks by optimizing the prompt while simultaneously co-optimizing a special verification query and content-level signal marks. This is achieved by leveraging cyclic output signals and injecting auxiliary tokens to ensure reliable auditing in content-only scenarios. Additionally, it incorporates cover tokens to protect the watermark from malicious deletion. For copyright verification, PromptCOS identifies unauthorized usage by comparing the similarity between the suspicious output and the signal mark. Experimental results demonstrate that our method achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% greater than the best baseline), high fidelity (accuracy degradation of no more than 0.58%), robustness (resilience against three types of potential attacks), and computational efficiency (up to 98.1% reduction in computational cost). Our code is available at GitHub https://github.com/LianPing-cyber/PromptCOS.
中文: 大语言模型的快速发展增强了对系统提示词保护的需求,PromptCOS方法通过优化提示和信号标记嵌入水印进行版权验证,确保了高效性、鲁棒性和实用性。
English: The rapid advancement of large language models has increased the need for protecting system prompts from theft, leading to the development of PromptCOS, a method that embeds watermarks for copyright verification by optimizing prompts and signal marks to ensure high effectiveness, robustness, and efficiency.

Authors:Xingyue Huang, Rishabh, Gregor Franke, Ziyi Yang, Jiamu Bai, Weijie Bai, Jinhe Bi, Zifeng Ding, Yiqun Duan, Chengyu Fan, Wendong Fan, Xin Gao, Ruohao Guo, Yuan He, Zhuangzhuang He, Xianglong Hu, Neil Johnson, Bowen Li, Fangru Lin, Siyu Lin, Tong Liu, Yunpu Ma, Hao Shen, Hao Sun, Beibei Wang, Fangyijie Wang, Hao Wang, Haoran Wang, Yang Wang, Yifeng Wang, Zhaowei Wang, Ziyang Wang, Yifan Wu, Zikai Xiao, Chengxing Xie, Fan Yang, Junxiao Yang, Qianshuo Ye, Ziyu Ye, Guangtao Zeng, Yuwen Ebony Zhang, Zeyu Zhang, Zihao Zhu, Bernard Ghanem, Philip Torr, Guohao Li
Title: Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Abstract:
Recent advances in Large Language Models (LLMs) have shown that their reasoning capabilities can be significantly improved through Reinforcement Learning with Verifiable Reward (RLVR), particularly in domains like mathematics and programming, where ground-truth correctness can be automatically evaluated. However, extending this success to other reasoning-intensive domains remains challenging due to the scarcity of high-quality, verifiable datasets and the high cost of human supervision. In this work, we introduce the Loong Project: an open-source framework for scalable synthetic data generation and verification across a diverse range of reasoning-intensive domains. The framework consists of two key components: (1) LoongBench, a curated seed dataset containing 8,729 human-vetted examples across 12 domains (e.g., Advanced Mathematics, Chemistry, Logic), each paired with executable code and rich metadata; and (2) LoongEnv, a modular synthetic data generation environment that supports multiple prompting strategies to produce new question-answer-code triples. Together, these components form an agent-environment loop that enables reinforcement learning, where an LLM-based agent is rewarded for generating Chain-of-Thought (CoT) solutions that align with code-executed answers. Empirically, we benchmark LoongBench on a broad suite of both open-source and proprietary LLMs to evaluate domain coverage and reveal performance bottlenecks. In addition, we conduct a comprehensive analysis of synthetic data generated by LoongEnv, examining correctness, difficulty, and diversity. Code and documentation are available at https://github.com/camel-ai/loong.
中文: Loong项目推出了一个开源框架,通过LoongBench精选数据集和LoongEnv合成数据生成环境,在多样化推理领域实现可扩展的数据生成与验证,解决了大语言模型在数学和编程之外领域扩展推理能力的挑战。
English: The Loong Project introduces an open-source framework for scalable synthetic data generation and verification across diverse reasoning domains, addressing the challenge of extending LLM reasoning capabilities beyond mathematics and programming through its components LoongBench and LoongEnv.

Authors:Zhenhua Xu, Meng Han, Wenpeng Xing
Title: EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint
Abstract:
The proliferation of large language models (LLMs) has intensified concerns over model theft and license violations, necessitating robust and stealthy ownership verification. Existing fingerprinting methods either require impractical white-box access or introduce detectable statistical anomalies. We propose EverTracer, a novel gray-box fingerprinting framework that ensures stealthy and robust model provenance tracing. EverTracer is the first to repurpose Membership Inference Attacks (MIAs) for defensive use, embedding ownership signals via memorization instead of artificial trigger-output overfitting. It consists of Fingerprint Injection, which fine-tunes the model on any natural language data without detectable artifacts, and Verification, which leverages calibrated probability variation signal to distinguish fingerprinted models. This approach remains robust against adaptive adversaries, including input level modification, and model-level modifications. Extensive experiments across architectures demonstrate EverTracer's state-of-the-art effectiveness, stealthness, and resilience, establishing it as a practical solution for securing LLM intellectual property. Our code and data are publicly available at https://github.com/Xuzhenhua55/EverTracer.
中文摘要:EverTracer是一种创新的灰盒指纹框架,通过重新利用成员推理攻击在大型语言模型中嵌入隐蔽的所有权信号,无需可检测的人工痕迹即可实现稳健的知识产权保护。
English Summary: EverTracer is a novel gray-box fingerprinting framework that repurposes Membership Inference Attacks to embed stealthy ownership signals in large language models, ensuring robust intellectual property protection without detectable artifacts.

Authors:Xinzhe Zheng, Zhen-Qun Yang, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin
Title: Binary Quantization For LLMs Through Dynamic Grouping
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of Natural Language Processing (NLP) tasks, but require substantial memory and computational resources. Binary quantization, which compresses model weights from 16-bit Brain Float to 1-bit representations in {-1, 1}, offers significant reductions in storage and inference costs. However, such aggressive quantization often leads to notable performance degradation compared to more conservative 4-bit quantization methods. In this research, we propose a novel optimization objective tailored for binary quantization, along with three algorithms designed to realize it effectively. Our method enhances blocked quantization by dynamically identifying optimal unstructured sub-matrices through adaptive grouping strategies. Experimental results demonstrate that our approach achieves an average bit length of just 1.007 bits, while maintaining high model quality. Specifically, our quantized LLaMA 3.2 3B model attains a perplexity of 8.23, remarkably close to the original 7.81, and surpasses previous SOTA BiLLM with a perplexity of only 123.90. Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ in both performance and efficiency. The compression process is highly efficient, requiring only 14 seconds to quantize the full LLaMA 3.2 3B weights on a single CPU core, with the entire process completing in under 100 minutes and exhibiting embarrassingly parallel properties. Code - https://github.com/johnnyzheng0636/WGM_bi_quan
中文: 本研究提出了一种创新的二值量化方法,将大语言模型压缩至平均1.007比特的同时保持优异性能,其困惑度接近原始模型并超越现有最优方法,且具备高效并行处理能力。
English: This research introduces a novel binary quantization method that reduces large language models to an average of 1.007 bits while maintaining high performance, achieving perplexity scores close to original models and surpassing previous state-of-the-art approaches with efficient parallel processing.

Authors:Tzuhsuan Huang, Cheng Yu Yeo, Tsai-Ling Huang, Hong-Han Shuai, Wen-Huang Cheng, Jun-Cheng Chen
Title: Enhancing Robustness in Post-Processing Watermarking: An Ensemble Attack Network Using CNNs and Transformers
Abstract:
Recent studies on deep watermarking have predominantly focused on in-processing watermarking, which integrates the watermarking process into image generation. However, post-processing watermarking, which embeds watermarks after image generation, offers more flexibility. It can be applied to outputs from any generative model (e.g. GANs, diffusion models) without needing access to the model's internal structure. It also allows users to embed unique watermarks into individual images. Therefore, this study focuses on post-processing watermarking and enhances its robustness by incorporating an ensemble attack network during training. We construct various versions of attack networks using CNN and Transformer in both spatial and frequency domains to investigate how each combination influences the robustness of the watermarking model. Our results demonstrate that combining a CNN-based attack network in the spatial domain with a Transformer-based attack network in the frequency domain yields the highest robustness in watermarking models. Extensive evaluation on the WAVES benchmark, using average bit accuracy as the metric, demonstrates that our ensemble attack network significantly enhances the robustness of baseline watermarking methods under various stress tests. In particular, for the Regeneration Attack defined in WAVES, our method improves StegaStamp by 18.743%. The code is released at:https://github.com/aiiu-lab/DeepRobustWatermark.
中文摘要:本研究通过训练中引入集成攻击网络,结合空间域的CNN和频域的Transformer模型,显著增强了后处理水印的鲁棒性,尤其在WAVES基准测试中将StegaStamp对再生攻击的抵抗能力提升了18.743%。
English Summary: This study advances post-processing watermarking by integrating an ensemble attack network during training, combining CNN and Transformer models across spatial and frequency domains to significantly boost watermark robustness, particularly improving StegaStamp by 18.743% against regeneration attacks.

Authors:Shuai Jiang, Yunfeng Ma, Jingyu Zhou, Yuan Bian, Yaonan Wang, Min Liu
Title: Resilient Multimodal Industrial Surface Defect Detection with Uncertain Sensors Availability
Abstract:
Multimodal industrial surface defect detection (MISDD) aims to identify and locate defect in industrial products by fusing RGB and 3D modalities. This article focuses on modality-missing problems caused by uncertain sensors availability in MISDD. In this context, the fusion of multiple modalities encounters several troubles, including learning mode transformation and information vacancy. To this end, we first propose cross-modal prompt learning, which includes: i) the cross-modal consistency prompt serves the establishment of information consistency of dual visual modalities; ii) the modality-specific prompt is inserted to adapt different input patterns; iii) the missing-aware prompt is attached to compensate for the information vacancy caused by dynamic modalities-missing. In addition, we propose symmetric contrastive learning, which utilizes text modality as a bridge for fusion of dual vision modalities. Specifically, a paired antithetical text prompt is designed to generate binary text semantics, and triple-modal contrastive pre-training is offered to accomplish multimodal learning. Experiment results show that our proposed method achieves 73.83% I-AUROC and 93.05% P-AUROC with a total missing rate 0.7 for RGB and 3D modalities (exceeding state-of-the-art methods 3.84% and 5.58% respectively), and outperforms existing approaches to varying degrees under different missing types and rates. The source code will be available at https://github.com/SvyJ/MISDD-MM.
中文: 本文提出了一种新颖的多模态工业表面缺陷检测方法,通过跨模态提示学习和对称对比学习解决模态缺失问题,在不同缺失类型和比率下均优于现有方法。
English: This article introduces a novel method for multimodal industrial surface defect detection that addresses modality-missing issues through cross-modal prompt learning and symmetric contrastive learning, achieving superior performance over existing approaches.

Authors:Zeyu Liu, Shengwei Ding
Title: STAR: A Fast and Robust Rigid Registration Framework for Serial Histopathological Images
Abstract:
Registration of serial whole-slide histopathological images (WSIs) is critical for enabling direct comparison across diverse stains and for preparing paired datasets in artificial intelligence (AI) workflows such as virtual staining and biomarker prediction. While existing methods often rely on complex deformable or deep learning approaches that are computationally intensive and difficult to reproduce, lightweight rigid frameworks-sufficient for many consecutive-section scenarios-remain underdeveloped. We introduce STAR (Serial Tissue Alignment for Rigid registration), a fast and robust open-source framework for multi-WSI alignment. STAR integrates stain-conditioned preprocessing with a hierarchical coarse-to-fine correlation strategy, adaptive kernel scaling, and built-in quality control, achieving reliable rigid registration across heterogeneous tissue types and staining protocols, including hematoxylin-eosin (H&E), special histochemical stains (e.g., PAS, PASM, Masson's), and immunohistochemical (IHC) markers (e.g., CD31, KI67). Evaluated on the ANHIR 2019 and ACROBAT 2022 datasets spanning multiple organs and scanning conditions, STAR consistently produced stable alignments within minutes per slide, demonstrating robustness to cross-stain variability and partial tissue overlap. Beyond benchmarks, we present case studies on H&E-IHC alignment, construction of multi-IHC panels, and typical failure modes, underscoring both utility and limitations. Released as an open and lightweight tool, STAR provides a reproducible baseline that lowers the barrier for clinical adoption and enables large-scale paired data preparation for next-generation computational pathology.
中文:STAR是一种快速、鲁棒的开源框架,用于连续全切片病理图像的刚性配准,它结合了特定预处理与分层相关策略,能在不同染色方案和组织类型间实现可靠对齐。
English: STAR is a fast, robust open-source framework for rigid registration of serial whole-slide histopathological images, integrating specialized preprocessing with hierarchical correlation to achieve reliable alignment across diverse stains and tissue types.

Authors:Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Susan Holm, Yuran Wang, Vincent Zhou, Ken Fukuda, Teruko Mitamura
Title: ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
Abstract:
Assistants on assembly tasks have a large potential to benefit humans from everyday tasks to industrial settings. However, no testbeds support application-oriented system evaluation in a practical setting, especially in assembly. To foster the development, we propose a new multimodal QA dataset on assembly activities. Our dataset, ProMQA-Assembly, consists of 391 QA pairs that require the multimodal understanding of human-activity recordings and their instruction manuals in an online-style manner. In the development, we adopt a semi-automated QA annotation approach, where LLMs generate candidates and humans verify them, as a cost-effective method, and further improve it by integrating fine-grained action labels to diversify question types. Furthermore, we create instruction task graphs for the target tasks of assembling toy vehicles. These newly created task graphs are used in our benchmarking experiment, as well as to facilitate the human verification process in the QA annotation. Utilizing our dataset, we benchmark models, including competitive proprietary multimodal models. Our results suggest great room for improvement for the current models. We believe our new evaluation dataset can contribute to the further development of procedural-activity assistants.
中文: 本文提出了ProMQA-Assembly多模态问答数据集,包含391个问答对,采用半自动标注方法评估实际装配任务中的AI助手,基准测试表明现有模型仍有很大改进空间。
English: This paper introduces ProMQA-Assembly, a multimodal QA dataset with 391 question-answer pairs designed to evaluate AI assistants in practical assembly tasks, using a semi-automated annotation method and benchmarking that reveals significant room for improvement in current models.

Authors:Armin Saadat, Nima Hashemi, Hooman Vaseli, Michael Y. Tsang, Christina Luong, Michiel Van de Panne, Teresa S. M. Tsang, Purang Abolmaesumi
Title: PRECISE-AS: Personalized Reinforcement Learning for Efficient Point-of-Care Echocardiography in Aortic Stenosis Diagnosis
Abstract:
Aortic stenosis (AS) is a life-threatening condition caused by a narrowing of the aortic valve, leading to impaired blood flow. Despite its high prevalence, access to echocardiography (echo), the gold-standard diagnostic tool, is often limited due to resource constraints, particularly in rural and underserved areas. Point-of-care ultrasound (POCUS) offers a more accessible alternative but is restricted by operator expertise and the challenge of selecting the most relevant imaging views. To address this, we propose a reinforcement learning (RL)-driven active video acquisition framework that dynamically selects each patient's most informative echo videos. Unlike traditional methods that rely on a fixed set of videos, our approach continuously evaluates whether additional imaging is needed, optimizing both accuracy and efficiency. Tested on data from 2,572 patients, our method achieves 80.6% classification accuracy while using only 47% of the echo videos compared to a full acquisition. These results demonstrate the potential of active feature acquisition to enhance AS diagnosis, making echocardiographic assessments more efficient, scalable, and personalized. Our source code is available at: https://github.com/Armin-Saadat/PRECISE-AS.
中文: 该研究提出强化学习驱动的动态采集框架,通过智能选择最具诊断价值的心脏超声视频,仅用47%的视频量即实现80.6%的主动脉狭窄分类准确率,显著提升了诊断效率与可及性。
English: The proposed reinforcement learning framework dynamically selects the most informative echocardiography videos for aortic stenosis diagnosis, achieving 80.6% accuracy while using only 47% of videos compared to full acquisition, thereby enhancing diagnostic efficiency and accessibility.

Authors:Harsh Muriki, Hong Ray Teo, Ved Sengupta, Ai-Ping Hu
Title: Robotic 3D Flower Pose Estimation for Small-Scale Urban Farms
Abstract:
The small scale of urban farms and the commercial availability of low-cost robots (such as the FarmBot) that automate simple tending tasks enable an accessible platform for plant phenotyping. We have used a FarmBot with a custom camera end-effector to estimate strawberry plant flower pose (for robotic pollination) from acquired 3D point cloud models. We describe a novel algorithm that translates individual occupancy grids along orthogonal axes of a point cloud to obtain 2D images corresponding to the six viewpoints. For each image, 2D object detection models for flowers are used to identify 2D bounding boxes which can be converted into the 3D space to extract flower point clouds. Pose estimation is performed by fitting three shapes (superellipsoids, paraboloids and planes) to the flower point clouds and compared with manually labeled ground truth. Our method successfully finds approximately 80% of flowers scanned using our customized FarmBot platform and has a mean flower pose error of 7.7 degrees, which is sufficient for robotic pollination and rivals previous results. All code will be made available at https://github.com/harshmuriki/flowerPose.git.
中文摘要:研究人员利用定制化FarmBot开发了一种新算法,用于自动化检测草莓花朵并估算其姿态以实现机器人授粉,达到了80%的检测准确率和7.7度的平均姿态误差。
English Summary: Researchers developed a novel algorithm using a customized FarmBot to automate strawberry flower detection and pose estimation for robotic pollination, achieving 80% detection accuracy with a mean pose error of 7.7 degrees.

Authors:Mennatullah Siam
Title: PixFoundation 2.0: Do Video Multi-Modal LLMs Use Motion in Visual Grounding?
Abstract:
Multi-modal large language models (MLLMs) have shown impressive generalization across tasks using images and text modalities. While their extension to video has enabled tasks such as video question answering and video captioning, their pixel-level visual grounding abilities are less studied. In this work, we raise the pertinent question of whether motion is used in pixel-level visual grounding and whether video MLLMs can segment objects based on natural language expressions describing their motion patterns. We identify the shortcomings in the current benchmarks, where we show that a single frame can often suffice for capturing the motion referring expression without any temporal reasoning. To address this, we introduce four motion-centric probing techniques, particularly designed for the visual grounding task, to study video MLLMs' ability to identify true motion from a fake one and their ability to grasp the motion order. Consequently, we provide a motion-centric benchmark, MoCentric-Bench. It ensures that video MLLMs are evaluated towards leveraging the interaction between motion and language rather than being dominated by static appearance cues emphasized in existing visual grounding datasets. We further establish strong single-image baselines that are on par with or outperform prior methods. Finally, we explore simple motion-centric adaptation techniques that provide state-of-the-art performance on our MoCentric-Bench. Our motion-centric benchmark, evaluation and findings challenge future models to improve dense spatiotemporal grounding and pixel-level understanding within videos. Code and datasets will be made publicly available at https://github.com/MSiam/PixFoundation-2.0.git.
中文: 本研究提出了以运动为中心的基准MoCentric-Bench,旨在评估视频多模态大语言模型利用运动线索进行像素级视觉定位的能力,通过四项运动探测技术验证模型对真实运动与运动顺序的理解,弥补现有数据集中静态特征主导的不足。
English: This study introduces MoCentric-Bench, a motion-centric benchmark designed to evaluate video multi-modal large language models' ability to perform pixel-level visual grounding by leveraging motion cues and temporal reasoning, addressing limitations in existing datasets that overemphasize static appearance.

Authors:Jie Xiao, Mengye Lyu, Shaojun Liu
Title: A Two-Stage Strategy for Mitosis Detection Using Improved YOLO11x Proposals and ConvNeXt Classification
Abstract:
MIDOG 2025 Track 1 requires mitosis detection in whole-slide images (WSIs) containing non-tumor, inflamed, and necrotic regions. Due to the complicated and heterogeneous context, as well as possible artifacts, there are often false positives and false negatives, thus degrading the detection F1-score. To address this problem, we propose a two-stage framework. Firstly, an improved YOLO11x, integrated with EMA attention and LSConv, is employed to generate mitosis candidates. We use a low confidence threshold to generate as many proposals as possible, ensuring the detection recall. Then, a ConvNeXt-Tiny classifier is employed to filter out the false positives, ensuring the detection precision. Consequently, the proposed two-stage framework can generate a high detection F1-score. Evaluated on a fused dataset comprising MIDOG++, MITOS_WSI_CCMCT, and MITOS_WSI_CMC, our framework achieves an F1-score of 0.882, which is 0.035 higher than the single-stage YOLO11x baseline. This performance gain is produced by a significant precision improvement, from 0.762 to 0.839, and a comparable recall. The code is available at https://github.com/xxiao0304/MIDOG-2025-Track-1-of-SZTU.
中文: 该研究提出了一种两阶段框架,结合改进的YOLO11x模型生成候选目标,并使用ConvNeXt-Tiny分类器过滤误检,在融合数据集上F1分数达0.882,在MIDOG 2025 Track 1测试集上达0.7587。
English: The study introduces a two-stage framework combining an enhanced YOLO11x model for candidate detection and a ConvNeXt-Tiny classifier to filter false positives, achieving a higher F1-score of 0.882 on a fused dataset and 0.7587 on the MIDOG 2025 Track 1 test set.

Authors:Jie Xiao, Mengye Lyu, Shaojun Liu
Title: A Two-Stage Strategy for Mitosis Detection Using Improved YOLO11x Proposals and ConvNeXt Classification
Abstract:
MIDOG 2025 Track 1 requires mitosis detection in whole-slideimages (WSIs) containing non-tumor, inflamed, and necrotic re-gions. Due to the complicated and heterogeneous context, aswell as possible artifacts, there are often false positives and falsenegatives, thus degrading the detection F1-score. To addressthis problem, we propose a two-stage framework. Firstly, an im-proved YOLO11x, integrated with EMA attention and LSConv,is employed to generate mitosis candidates. We use a low confi-dence threshold to generate as many proposals as possible, en-suring the detection recall. Then, a ConvNeXt-Tiny classifieris employed to filter out the false positives, ensuring the detec-tion precision. Consequently, the proposed two-stage frame-work can generate a high detection F1-score. Evaluated on afused dataset comprising MIDOG++, MITOS_WSI_CCMCT,and MITOS_WSI_CMC, our framework achieves an F1-scoreof 0.882, which is 0.035 higher than the single-stage YOLO11xbaseline. This performance gain is produced by a significantprecision improvement, from 0.762 to 0.839, and a comparablerecall. On the MIDOG 2025 Track 1 preliminary test set, thealgorithm scores an F1 score of 0.7587. The code is available athttps://github.com/xxiao0304/MIDOG-2025-Track-1-of-SZTU.
中文: 该研究提出了一种两阶段框架,结合改进的YOLO11x模型生成候选目标,并使用ConvNeXt-Tiny分类器过滤误检,在融合数据集上F1分数达0.882,在MIDOG 2025 Track 1测试集上达0.7587。
English: The study introduces a two-stage framework combining an enhanced YOLO11x model for candidate detection and a ConvNeXt-Tiny classifier to filter false positives, achieving a higher F1-score of 0.882 on a fused dataset and 0.7587 on the MIDOG 2025 Track 1 test set.

Authors:Xinrui Gong, Oliver Hahn, Christoph Reich, Krishnakant Singh, Simone Schaub-Meyer, Daniel Cremers, Stefan Roth
Title: Motion-Refined DINOSAUR for Unsupervised Multi-Object Discovery
Abstract:
Unsupervised multi-object discovery (MOD) aims to detect and localize distinct object instances in visual scenes without any form of human supervision. Recent approaches leverage object-centric learning (OCL) and motion cues from video to identify individual objects. However, these approaches use supervision to generate pseudo labels to train the OCL model. We address this limitation with MR-DINOSAUR -- Motion-Refined DINOSAUR -- a minimalistic unsupervised approach that extends the self-supervised pre-trained OCL model, DINOSAUR, to the task of unsupervised multi-object discovery. We generate high-quality unsupervised pseudo labels by retrieving video frames without camera motion for which we perform motion segmentation of unsupervised optical flow. We refine DINOSAUR's slot representations using these pseudo labels and train a slot deactivation module to assign slots to foreground and background. Despite its conceptual simplicity, MR-DINOSAUR achieves strong multi-object discovery results on the TRI-PD and KITTI datasets, outperforming the previous state of the art despite being fully unsupervised.
Chinese: MR-DINOSAUR是一种完全无监督的方法,通过运动生成的伪标签优化自监督的以物体为中心学习模型,在无需人工标注的情况下实现了最先进的多物体发现性能。
English: MR-DINOSAUR is a fully unsupervised approach that refines self-supervised object-centric learning models using motion-based pseudo labels to achieve state-of-the-art multi-object discovery without human supervision.

Authors:Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang, Wanwei He, Run Luo, Min Yang
Title: Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Abstract:
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming. RLVR leverages verifiable outcome rewards to guide policy optimization, enabling LLMs to progressively improve output quality in a grounded and reliable manner. Despite its promise, the RLVR paradigm poses significant challenges, as existing methods often suffer from sparse reward signals and unstable policy gradient updates, particularly in RL-based approaches. To address the challenges, we propose $\textbf{PACS}$, a novel RLVR framework that achieves im$\textbf{P}$licit $\textbf{A}$ctor $\textbf{C}$ritic coupling via a $\textbf{S}$upervised learning framework. By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss. A detailed gradient analysis shows that this supervised formulation inherently recovers the classical policy gradient update while implicitly coupling actor and critic roles, yielding more stable and efficient training. Benchmarking on challenging mathematical reasoning tasks, PACS outperforms strong RLVR baselines, such as PPO and GRPO, achieving superior reasoning performance. For instance, PACS achieves 59.78\% at pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO. This simple yet powerful framework offers a promising avenue for LLMs post-training with verifiable rewards. Our code and data are available as open source at https://github.com/ritzz-ai/PACS.
中文摘要:PACS框架通过监督学习方式重构可验证奖励的强化学习问题,隐式耦合行动者与评论者角色,在数学推理任务上实现了比传统方法更稳定的训练和更优异的性能表现。
English Summary: The PACS framework introduces a supervised learning approach to Reinforcement Learning with Verifiable Rewards, implicitly coupling actor and critic roles to achieve more stable training and superior performance on mathematical reasoning tasks compared to existing methods.

Authors:Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram
Title: Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Abstract:
Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including temperature scaling, top-\$p\$ (nucleus) sampling, and min-\$p\$ sampling, aim to manage this trade-off. However, they exhibit limitations, particularly in the effective incorporation of the confidence of the model into the corresponding sampling strategy. For example, min-\$p\$ sampling relies on a single top token as a heuristic for confidence, eventually underutilizing the information of the probability distribution. Toward effective incorporation of the confidence of the model, in this paper, we present **top-H** decoding. We first establish the theoretical foundation of the interplay between creativity and coherence in truncated sampling by formulating an **entropy-constrained minimum divergence** problem. We then prove this minimization problem to be equivalent to an **entropy-constrained mass maximization** (ECMM) problem, which is NP-hard. Finally, we present top-H decoding, a computationally efficient greedy algorithm to solve the ECMM problem. Extensive empirical evaluations demonstrate that top-H outperforms the state-of-the-art (SoTA) alternative of min-\$p\$ sampling by up to **25.63%** on creative writing benchmarks, while maintaining robustness on question-answering datasets such as GPQA, GSM8K, and MT-Bench. Additionally, an *LLM-as-judge* evaluation confirms that top-H indeed produces coherent outputs even at higher temperatures, where creativity is especially critical. In summary, top-H advances SoTA in open-ended text generation and can be *easily integrated* into creative writing applications. The code is available at https://github.com/ErfanBaghaei/Top-H-Decoding.
中文摘要:大语言模型在开放文本生成中存在创造力与逻辑连贯性的平衡难题,而新提出的top-H解码方法通过有效整合模型置信度,在创意写作任务中比现有最佳方法提升高达25.63%的性能,同时保持问答任务的稳健性。
English Summary: Large language models face a trade-off between creativity and coherence in text generation, and the proposed top-H decoding method effectively incorporates model confidence to outperform existing techniques by up to 25.63% on creative writing tasks while maintaining robustness.

Authors:Nishant Tanksale, Tanmay Kokate, Darshan Gohad, Sarvadnyaa Barate, Raviraj Joshi
Title: L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages
Abstract:
Semantic evaluation in low-resource languages remains a major challenge in NLP. While sentence transformers have shown strong performance in high-resource settings, their effectiveness in Indic languages is underexplored due to a lack of high-quality benchmarks. To bridge this gap, we introduce L3Cube-IndicHeadline-ID, a curated headline identification dataset spanning ten low-resource Indic languages: Marathi, Hindi, Tamil, Gujarati, Odia, Kannada, Malayalam, Punjabi, Telugu, Bengali and English. Each language includes 20,000 news articles paired with four headline variants: the original, a semantically similar version, a lexically similar version, and an unrelated one, designed to test fine-grained semantic understanding. The task requires selecting the correct headline from the options using article-headline similarity. We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity. Results show that multilingual models consistently perform well, while language-specific models vary in effectiveness. Given the rising use of similarity models in Retrieval-Augmented Generation (RAG) pipelines, this dataset also serves as a valuable resource for evaluating and improving semantic understanding in such applications. Additionally, the dataset can be repurposed for multiple-choice question answering, headline classification, or other task-specific evaluations of LLMs, making it a versatile benchmark for Indic NLP. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp
中文摘要:本文针对低资源印度语言提出L3Cube-IndicHeadline-ID多语言数据集,通过新闻标题识别任务验证了多语言句子转换器相比语言特定模型具有更优的语义理解能力。
English Summary: This paper introduces L3Cube-IndicHeadline-ID, a multilingual dataset for evaluating semantic understanding in ten low-resource Indic languages, demonstrating that multilingual sentence transformers outperform language-specific models in headline identification tasks.

Authors:Junxi Wu, Jinpeng Wang, Zheng Liu, Bin Chen, Dongjian Hu, Hao Wu, Shu-Tao Xia
Title: MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds
Abstract:
The rapid advancement of large language models has intensified public concerns about the potential misuse. Therefore, it is important to build trustworthy AI-generated text detection systems. Existing methods neglect stylistic modeling and mostly rely on static thresholds, which greatly limits the detection performance. In this paper, we propose the Mixture of Stylistic Experts (MoSEs) framework that enables stylistics-aware uncertainty quantification through conditional threshold estimation. MoSEs contain three core components, namely, the Stylistics Reference Repository (SRR), the Stylistics-Aware Router (SAR), and the Conditional Threshold Estimator (CTE). For input text, SRR can activate the appropriate reference data in SRR and provide them to CTE. Subsequently, CTE jointly models the linguistic statistical properties and semantic features to dynamically determine the optimal threshold. With a discrimination score, MoSEs yields prediction labels with the corresponding confidence level. Our framework achieves an average improvement 11.34% in detection performance compared to baselines. More inspiringly, MoSEs shows a more evident improvement 39.15% in the low-resource case. Our code is available at https://github.com/creator-xi/MoSEs.
中文: 本文提出的混合风格专家框架通过条件阈值估计实现风格感知的不确定性量化,显著提升了AI生成文本的检测性能,平均比基线方法提高了11.34%。
English: This paper introduces the Mixture of Stylistic Experts (MoSEs) framework, which enhances AI-generated text detection by dynamically estimating thresholds based on stylistic modeling, achieving an 11.34% average performance improvement over baseline methods.

Authors:Aishwarya Sarkar, Autrin Hakimi, Xiaoqiong Chen, Hai Huang, Chaoqun Lu, Ibrahim Demir, Ali Jannesari
Title: HydroGAT: Distributed Heterogeneous Graph Attention Transformer for Spatiotemporal Flood Prediction
Abstract:
Accurate flood forecasting remains a challenge for water-resource management, as it demands modeling of local, time-varying runoff drivers (e.g., rainfall-induced peaks, baseflow trends) and complex spatial interactions across a river network. Traditional data-driven approaches, such as convolutional networks and sequence-based models, ignore topological information about the region. Graph Neural Networks (GNNs) propagate information exactly along the river network, which is ideal for learning hydrological routing. However, state-of-the-art GNN-based flood prediction models collapse pixels to coarse catchment polygons as the cost of training explodes with graph size and higher resolution. Furthermore, most existing methods treat spatial and temporal dependencies separately, either applying GNNs solely on spatial graphs or transformers purely on temporal sequences, thus failing to simultaneously capture spatiotemporal interactions critical for accurate flood prediction. We introduce a heterogenous basin graph where every land and river pixel is a node connected by physical hydrological flow directions and inter-catchment relationships. We propose HydroGAT, a spatiotemporal network that adaptively learns local temporal importance and the most influential upstream locations. Evaluated in two Midwestern US basins and across five baseline architectures, our model achieves higher NSE (up to 0.97), improved KGE (up to 0.96), and low bias (PBIAS within $\pm$5%) in hourly discharge prediction, while offering interpretable attention maps that reveal sparse, structured intercatchment influences. To support high-resolution basin-scale training, we develop a distributed data-parallel pipeline that scales efficiently up to 64 NVIDIA A100 GPUs on NERSC Perlmutter supercomputer, demonstrating up to 15x speedup across machines. Our code is available at https://github.com/swapp-lab/HydroGAT.
中文: HydroGAT通过构建异构流域图并开发时空网络,有效捕捉水文交互作用,在洪水预测中实现了更高精度和可解释性,同时支持可扩展的高分辨率模型训练。
English: HydroGAT introduces a heterogeneous basin graph and a spatiotemporal network that effectively captures hydrological interactions, achieving superior accuracy and interpretability in flood forecasting while enabling scalable high-resolution training.

Authors:Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, Kai Yuan
Title: Unifi3D: A Study on 3D Representations for Generation and Reconstruction in a Common Framework
Abstract:
Following rapid advancements in text and image generation, research has increasingly shifted towards 3D generation. Unlike the well-established pixel-based representation in images, 3D representations remain diverse and fragmented, encompassing a wide variety of approaches such as voxel grids, neural radiance fields, signed distance functions, point clouds, or octrees, each offering distinct advantages and limitations. In this work, we present a unified evaluation framework designed to assess the performance of 3D representations in reconstruction and generation. We compare these representations based on multiple criteria: quality, computational efficiency, and generalization performance. Beyond standard model benchmarking, our experiments aim to derive best practices over all steps involved in the 3D generation pipeline, including preprocessing, mesh reconstruction, compression with autoencoders, and generation. Our findings highlight that reconstruction errors significantly impact overall performance, underscoring the need to evaluate generation and reconstruction jointly. We provide insights that can inform the selection of suitable 3D models for various applications, facilitating the development of more robust and application-specific solutions in 3D generation. The code for our framework is available at https://github.com/isl-org/unifi3d.
中文: 本研究提出一个统一的评估框架,用于比较各种三维表示在重建和生成中的表现,强调联合评估对优化质量、效率和特定应用需求的性能至关重要。
English: This study introduces a unified evaluation framework to compare diverse 3D representations in reconstruction and generation, emphasizing that joint assessment is crucial for optimizing performance across quality, efficiency, and application-specific needs.

Authors:Lingzhi Shen, Xiaohao Cai, Yunfei Long, Imran Razzak, Guanming Chen, Shoaib Jameel
Title: EmoPerso: Enhancing Personality Detection with Self-Supervised Emotion-Aware Modelling
Abstract:
Personality detection from text is commonly performed by analysing users' social media posts. However, existing methods heavily rely on large-scale annotated datasets, making it challenging to obtain high-quality personality labels. Moreover, most studies treat emotion and personality as independent variables, overlooking their interactions. In this paper, we propose a novel self-supervised framework, EmoPerso, which improves personality detection through emotion-aware modelling. EmoPerso first leverages generative mechanisms for synthetic data augmentation and rich representation learning. It then extracts pseudo-labeled emotion features and jointly optimizes them with personality prediction via multi-task learning. A cross-attention module is employed to capture fine-grained interactions between personality traits and the inferred emotional representations. To further refine relational reasoning, EmoPerso adopts a self-taught strategy to enhance the model's reasoning capabilities iteratively. Extensive experiments on two benchmark datasets demonstrate that EmoPerso surpasses state-of-the-art models. The source code is available at https://github.com/slz0925/EmoPerso.
中文摘要:EmoPerso框架通过情感感知建模,结合合成数据增强、多任务学习和交叉注意力机制,显著提升了从文本中检测人格特征的性能,并在基准数据集上超越了现有最优模型。
English Summary: The EmoPerso framework enhances personality detection by integrating emotion-aware modeling through synthetic data augmentation, multi-task learning, and cross-attention mechanisms, outperforming existing methods on benchmark datasets.

Authors:Jingru Fan, Yufan Dang, Jingyao Wu, Huatao Li, Runde Yang, Xiyuan Yang, Yuheng Wang, Zhong Zhang, Yaxi Lu, Yankai Lin, Zhiyuan Liu, Dahai Li, Chen Qian
Title: AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent
Abstract:
With the raid evolution of large language models and multimodal foundation models, the mobile-agent landscape has proliferated without converging on the fundamental challenges. This paper identifies four core problems that must be solved for mobile agents to deliver practical, scalable impact: (1) generalization across tasks, modalities, apps, and devices; (2) accuracy, specifically precise on-screen interaction and click targeting; (3) long-horizon capability for sustained, multi-step goals; and (4) efficiency, specifically high-performance runtime on resource-constrained devices. We present AppCopilot, a multimodal, multi-agent, general-purpose on-device assistant that operates across applications and constitutes a full-stack, closed-loop system from data to deployment. AppCopilot operationalizes this position through an end-to-end autonomous pipeline spanning data collection, training, deployment, high-quality and efficient inference, and mobile application development. At the model layer, it integrates multimodal foundation models with robust Chinese-English support. At the reasoning and control layer, it combines chain-of-thought reasoning, hierarchical task planning and decomposition, and multi-agent collaboration. At the execution layer, it enables user personalization and experiential adaptation, voice interaction, function calling, cross-app and cross-device orchestration, and comprehensive mobile app support. The system design incorporates profiling-driven optimization for latency, memory, and energy across heterogeneous hardware. Empirically, AppCopilot achieves significant improvements along all four dimensions: stronger generalization, higher-precision on-screen actions, more reliable long-horizon task completion, and faster, more resource-efficient runtime.
中文摘要:本文提出AppCopilot,一种设备端多模态助手,通过融合基础模型、多智能体协作和移动端优化部署,系统性地解决了移动智能体在泛化能力、操作精度、长程任务和运行效率四大核心难题。
English Summary: This paper introduces AppCopilot, an on-device multimodal assistant designed to address four core challenges in mobile agents—generalization, accuracy, long-horizon capability, and efficiency—through an integrated system combining foundation models, multi-agent collaboration, and optimized mobile deployment.

Authors:Yanwen Zou, Zhaoye Zhou, Chenyang Shi, Zewei Ye, Junda Huang, Yan Ding, Bo Zhao
Title: U-ARM : Ultra low-cost general teleoperation interface for robot manipulation
Abstract:
We propose U-Arm, a low-cost and rapidly adaptable leader-follower teleoperation framework designed to interface with most of commercially available robotic arms. Our system supports teleoperation through three structurally distinct 3D-printed leader arms that share consistent control logic, enabling seamless compatibility with diverse commercial robot configurations. Compared with previous open-source leader-follower interfaces, we further optimized both the mechanical design and servo selection, achieving a bill of materials (BOM) cost of only \$50.5 for the 6-DoF leader arm and \$56.8 for the 7-DoF version. To enhance usability, we mitigate the common challenge in controlling redundant degrees of freedom by %engineering methods mechanical and control optimizations. Experimental results demonstrate that U-Arm achieves 39\% higher data collection efficiency and comparable task success rates across multiple manipulation scenarios compared with Joycon, another low-cost teleoperation interface. We have open-sourced all CAD models of three configs and also provided simulation support for validating teleoperation workflows. We also open-sourced real-world manipulation data collected with U-Arm. The project website is https://github.com/MINT-SJTU/LeRobot-Anything-U-Arm.
中文: U-Arm是一种低成本、快速适配的主从遥操作框架,通过3D打印主臂和优化设计显著降低成本,在提升数据采集效率和任务成功率的同时兼容多种商用机器人配置。
English: U-Arm is a low-cost, adaptable leader-follower teleoperation framework compatible with most commercial robotic arms, featuring 3D-printed leader arms and optimized mechanics that reduce costs while improving data collection efficiency and task success rates.

Authors:Tao Wang, Zhenxuan Zhang, Yuanbo Zhou, Xinlin Zhang, Yuanbin Chen, Tao Tan, Guang Yang, Tong Tong
Title: From Noisy Labels to Intrinsic Structure: A Geometric-Structural Dual-Guided Framework for Noise-Robust Medical Image Segmentation
Abstract:
The effectiveness of convolutional neural networks in medical image segmentation relies on large-scale, high-quality annotations, which are costly and time-consuming to obtain. Even expert-labeled datasets inevitably contain noise arising from subjectivity and coarse delineations, which disrupt feature learning and adversely impact model performance. To address these challenges, this study propose a Geometric-Structural Dual-Guided Network (GSD-Net), which integrates geometric and structural cues to improve robustness against noisy annotations. It incorporates a Geometric Distance-Aware module that dynamically adjusts pixel-level weights using geometric features, thereby strengthening supervision in reliable regions while suppressing noise. A Structure-Guided Label Refinement module further refines labels with structural priors, and a Knowledge Transfer module enriches supervision and improves sensitivity to local details. To comprehensively assess its effectiveness, we evaluated GSD-Net on six publicly available datasets: four containing three types of simulated label noise, and two with multi-expert annotations that reflect real-world subjectivity and labeling inconsistencies. Experimental results demonstrate that GSD-Net achieves state-of-the-art performance under noisy annotations, achieving improvements of 2.52% on Kvasir, 22.76% on Shenzhen, 8.87% on BU-SUC, and 4.59% on BraTS2020 under SR simulated noise. The codes of this study are available at https://github.com/ortonwang/GSD-Net.
Chinese: 本研究提出GSD-Net,一种几何与结构双引导网络,通过动态调整监督和优化标签来增强对医学图像噪声标注的鲁棒性,在多个数据集上实现了最先进的性能。
English: This study introduces GSD-Net, a geometric-structural dual-guided network that enhances robustness against noisy medical image annotations by dynamically adjusting supervision and refining labels, achieving state-of-the-art performance across multiple datasets.

Authors:Xiaobao Wei, Changyong Shu, Zhaokun Yue, Chang Huang, Weiwei Liu, Shuai Yang, Lirong Yang, Peng Gao, Wenbin Zhang, Gaochao Zhu, Chengxiang Wang
Title: Decoupling Bidirectional Geometric Representations of 4D cost volume with 2D convolution
Abstract:
High-performance real-time stereo matching methods invariably rely on 3D regularization of the cost volume, which is unfriendly to mobile devices. And 2D regularization based methods struggle in ill-posed regions. In this paper, we present a deployment-friendly 4D cost aggregation network DBStereo, which is based on pure 2D convolutions. Specifically, we first provide a thorough analysis of the decoupling characteristics of 4D cost volume. And design a lightweight bidirectional geometry aggregation block to capture spatial and disparity representation respectively. Through decoupled learning, our approach achieves real-time performance and impressive accuracy simultaneously. Extensive experiments demonstrate that our proposed DBStereo outperforms all existing aggregation-based methods in both inference time and accuracy, even surpassing the iterative-based method IGEV-Stereo. Our study break the empirical design of using 3D convolutions for 4D cost volume and provides a simple yet strong baseline of the proposed decouple aggregation paradigm for further study. Code will be available at (\href{https://github.com/happydummy/DBStereo}{https://github.com/happydummy/DBStereo}) soon.
Chinese: DBStereo提出了一种基于纯二维卷积的易部署4D代价聚合网络,通过解耦空间和视差学习,在实现实时性能的同时获得了卓越精度,超越了现有方法且无需依赖三维正则化。
English: DBStereo introduces a deployment-friendly 4D cost aggregation network using pure 2D convolutions, achieving real-time performance and superior accuracy by decoupling spatial and disparity learning, outperforming existing methods without relying on 3D regularization.

Authors:Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang
Title: MedDINOv3: How to adapt vision foundation models for medical image segmentation?
Abstract:
Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.
Chinese: MedDINOv3通过领域自适应预训练和架构改进,将DINOv3视觉基础模型成功应用于医学图像分割,在多个基准测试中达到最优性能,有效解决了自然图像与医学图像间的领域差异及ViT模型性能不足的问题。
English: MedDINOv3 adapts the DINOv3 vision foundation model to medical imaging through domain-specific pretraining and architectural enhancements, achieving state-of-the-art performance across multiple segmentation benchmarks while addressing key challenges in transferability and ViT underperformance.

Authors:Jinseok Kim, Sukmin Cho, Soyeong Jeong, Sangyeop Kim, Sungzoon Cho
Title: Upcycling Candidate Tokens of Large Language Models for Query Expansion
Abstract:
Query Expansion (QE) improves retrieval performance by enriching queries with related terms. Recently, Large Language Models (LLMs) have been used for QE, but existing methods face a trade-off: generating diverse terms boosts performance but increases computational cost. To address this challenge, we propose Candidate Token Query Expansion (CTQE), which extracts diverse and relevant terms from a single LLM decoding pass by leveraging unselected candidate tokens. These tokens, though not part of the final output, are conditioned on the full query and capture useful information. By aggregating them, CTQE achieves both relevance and diversity without extra inference, reducing overhead and latency. Experiments show that CTQE delivers strong retrieval performance with significantly lower cost, outperforming or comparable to more expensive methods. Code is available at: https://github.com/bluejeans8/CTQE
中文摘要:候选词查询扩展(CTQE)方法通过利用未选中的候选词标记,在单次大语言模型解码中提取多样化扩展词,以更低成本实现优异的检索性能。
English Summary: The proposed Candidate Token Query Expansion (CTQE) method efficiently extracts diverse expansion terms from a single LLM decoding pass by utilizing unselected candidate tokens, achieving strong retrieval performance with reduced computational cost.

Authors:Jindong Li, Yali Fu, Li Fan, Jiahong Liu, Yao Shu, Chengwei Qin, Menglin Yang, Irwin King, Rex Ying
Title: Implicit Reasoning in Large Language Models: A Comprehensive Survey
Abstract:
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks. Reasoning with LLMs is central to solving multi-step problems and complex decision-making. To support efficient reasoning, recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning, where reasoning occurs silently via latent structures without emitting intermediate textual steps. Implicit reasoning brings advantages such as lower generation cost, faster inference, and better alignment with internal computation. Although prior surveys have discussed latent representations in the context of reasoning, a dedicated and mechanism-level examination of how reasoning unfolds internally within LLMs remains absent. This survey fills that gap by introducing a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies. We organize existing methods into three execution paradigms based on \textbf{\textit{how and where internal computation unfolds}}: latent optimization, signal-guided control, and layer-recurrent execution. We also review structural, behavioral and representation-based evidence that supports the presence of implicit reasoning in LLMs. We further provide a structured overview of the evaluation metrics and benchmarks used in existing works to assess the effectiveness and reliability of implicit reasoning. We maintain a continuously updated project at: https://github.com/digailab/awesome-llm-implicit-reasoning.
中文: 本综述提出以执行范式为核心的分类法,探讨大型语言模型内部如何进行隐性推理,将方法归纳为潜在优化、信号引导控制和层循环执行,并评述了支持证据与评估体系。
English: This survey introduces a taxonomy focused on execution paradigms to examine how implicit reasoning occurs internally within LLMs, organizing methods into latent optimization, signal-guided control, and layer-recurrent execution while reviewing supporting evidence and evaluation metrics.

Authors:Lan Wei, Lou Genoud, Dandan Zhang
Title: Physics-Informed Machine Learning with Adaptive Grids for Optical Microrobot Depth Estimation
Abstract:
Optical microrobots actuated by optical tweezers (OT) offer great potential for biomedical applications such as cell manipulation and microscale assembly. These tasks demand accurate three-dimensional perception to ensure precise control in complex and dynamic biological environments. However, the transparent nature of microrobots and low-contrast microscopic imaging challenge conventional deep learning methods, which also require large annotated datasets that are costly to obtain. To address these challenges, we propose a physics-informed, data-efficient framework for depth estimation of optical microrobots. Our method augments convolutional feature extraction with physics-based focus metrics, such as entropy, Laplacian of Gaussian, and gradient sharpness, calculated using an adaptive grid strategy. This approach allocates finer grids over microrobot regions and coarser grids over background areas, enhancing depth sensitivity while reducing computational complexity. We evaluate our framework on multiple microrobot types and demonstrate significant improvements over baseline models. Specifically, our approach reduces mean squared error (MSE) by over 60% and improves the coefficient of determination (R^2) across all test cases. Notably, even when trained on only 20% of the available data, our model outperforms ResNet50 trained on the full dataset, highlighting its robustness under limited data conditions. Our code is available at: https://github.com/LannWei/CBS2025.
Chinese Summary: 本研究提出了一种融合物理聚焦指标与自适应网格策略的深度估计框架,显著提升了光学微机器人的三维感知精度,在数据有限条件下误差降低超60%,性能优于传统方法。
English Summary: This study introduces a physics-informed framework that enhances depth estimation for optical microrobots by integrating focus metrics with adaptive grid strategies, achieving over 60% error reduction and superior performance with minimal training data.

Authors:Yihong Wu, Jinqiao Wei, Xionghui Zhao, Yidi Li, Shaoyi Du, Bin Ren, Nicu Sebe
Title: DSGC-Net: A Dual-Stream Graph Convolutional Network for Crowd Counting via Feature Correlation Mining
Abstract:
Deep learning-based crowd counting methods have achieved remarkable progress in recent years. However, in complex crowd scenarios, existing models still face challenges when adapting to significant density distribution differences between regions. Additionally, the inconsistency of individual representations caused by viewpoint changes and body posture differences further limits the counting accuracy of the models. To address these challenges, we propose DSGC-Net, a Dual-Stream Graph Convolutional Network based on feature correlation mining. DSGC-Net introduces a Density Approximation (DA) branch and a Representation Approximation (RA) branch. By modeling two semantic graphs, it captures the potential feature correlations in density variations and representation distributions. The DA branch incorporates a density prediction module that generates the density distribution map, and constructs a density-driven semantic graph based on density similarity. The RA branch establishes a representation-driven semantic graph by computing global representation similarity. Then, graph convolutional networks are applied to the two semantic graphs separately to model the latent semantic relationships, which enhance the model's ability to adapt to density variations and improve counting accuracy in multi-view and multi-pose scenarios. Extensive experiments on three widely used datasets demonstrate that DSGC-Net outperforms current state-of-the-art methods. In particular, we achieve MAE of 48.9 and 5.9 in ShanghaiTech Part A and Part B datasets, respectively. The released code is available at: https://github.com/Wu-eon/CrowdCounting-DSGCNet.
中文: DSGC-Net通过双流图卷积网络,构建密度和表征语义图来捕捉特征关联,有效提升了复杂人群场景下的计数精度,在多个数据集上实现了最优性能。
English: DSGC-Net, a dual-stream graph convolutional network, addresses crowd counting challenges by modeling density and representation correlations through two semantic graphs, achieving state-of-the-art performance on benchmark datasets.

Authors:Nils Hoehing, Mayug Maniparambil, Ellen Rushe, Noel E. O'Connor, Anthony Ventresque
Title: Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks
Abstract:
We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: https://github.com/nilshoehing/rocketscience
Chinese: RocketScience 是一个评估视觉语言模型空间关系理解能力的开源基准测试,发现现有模型存在显著缺陷,并证实空间推理能力是主要瓶颈,而非物体定位能力。
English: RocketScience is an open-source benchmark that evaluates spatial relation understanding in vision-language models, revealing significant deficiencies in current models despite high human performance and identifying spatial reasoning as the primary bottleneck.

Authors:Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj
Title: SALAD -- Semantics-Aware Logical Anomaly Detection
Abstract:
Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1%. Code: https://github.com/MaticFuc/SALAD
Chinese Summary: 提出的SALAD方法通过新的组合分支显式建模物体组合图分布,无需人工标注即可在MVTec LOCO基准上实现96.1%的图像级AUROC,显著提升了逻辑异常检测性能。
English Summary: The proposed SALAD method enhances logical anomaly detection by explicitly modeling object composition maps with a new composition branch, achieving a 96.1% AUROC on MVTec LOCO without requiring manual labels.

Authors:Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Jian Guo, Yuanzhuo Wang
Title: JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer
Abstract:
Current evaluation paradigms for large language models (LLMs) suffer from overestimated or biased evaluations and mismatched question difficulty, leading to incomplete evaluations of knowledge and capability boundaries, which hinder their effective application and optimization. To address these challenges, we propose Agent-as-Interviewer, a dynamic evaluation paradigm that employs LLM agents to conduct multi-turn interactions for evaluation. Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to invoke knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation, achieving more comprehensive evaluations of LLM's knowledge boundaries. It also leverages agents to plan query strategies for adjustment of the question difficulty levels, enhancing the difficulty control to match the actual capabilities of target LLMs. Based on this paradigm, we develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool and uses difficulty scoring as strategy guidance, thereby finally providing valuable suggestions to help targets optimize themselves. Extensive experiments validate the effectiveness of JudgeAgent's suggestions, demonstrating that Agent-as-Interviewer can accurately identify the knowledge and capability boundaries of target models. The source code is available on https://github.com/DataArcTech/JudgeAgent.
中文:Agent-as-Interviewer范式通过AI代理进行动态多轮交互和问题难度调节,解决了当前大语言模型评估的局限性,能更准确地识别知识边界并提供优化建议。
English: The Agent-as-Interviewer paradigm addresses limitations in current LLM evaluations by using AI agents to conduct dynamic multi-turn interactions and adjust question difficulty, enabling more accurate identification of knowledge boundaries and providing optimization suggestions.

Authors:Jian Chen, Jiabao Dou, Jinbao Tian, Yunqi Yang, Zhou Li
Title: Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports
Abstract:
The automatic classification of occupational accident reports is a critical research area for enhancing workplace safety and enabling large-scale risk analysis. However, the severe class imbalance inherent in these real-world datasets often compromises the performance of analytical models, particularly for rare but severe incident types, hindering the development of reliable automated systems. To address this challenge, we propose ABEX-RAT, a novel and efficient framework that synergizes generative data augmentation with robust adversarial training. Our approach first employs a twostep abstractive-expansive (ABEX) pipeline, which leverages a large language model to distill core incident semantics and then uses a generative model to create diverse, highquality synthetic samples for underrepresented classes. Subsequently, a lightweight classifier is trained on the augmented data using a computationally efficient random adversarial training (RAT) protocol, which stochastically applies perturbations to enhance model generalization and robustness without significant overhead. Experimental results on the public OSHA dataset demonstrate that our method achieves new state-of-the-art performance, reaching a macro-F1 score of 90.32% and significantly outperforming previous SOTA and fine-tuned large model baselines. Our work validates that this synergistic strategy is a highly effective and efficient alternative to brute-force fine-tuning for specialized, imbalanced classification tasks. The code is publicly available at:https://github.com/nxcc-lab/ABEX-RAT.
中文: 提出的ABEX-RAT框架结合生成式数据增强和对抗训练,有效解决了职业事故报告分类中的类别不平衡问题,在OSHA数据集上实现了最先进的性能表现。
English: The proposed ABEX-RAT framework combines generative data augmentation and adversarial training to effectively address class imbalance in occupational accident report classification, achieving state-of-the-art performance on the OSHA dataset.

Authors:Yuhao Wang, Junwei Pan, Xinhang Li, Maolin Wang, Yuan Wang, Yue Liu, Dapeng Liu, Jie Jiang, Xiangyu Zhao
Title: Empowering Large Language Model for Sequential Recommendation via Multimodal Embeddings and Semantic IDs
Abstract:
Sequential recommendation (SR) aims to capture users' dynamic interests and sequential patterns based on their historical interactions. Recently, the powerful capabilities of large language models (LLMs) have driven their adoption in SR. However, we identify two critical challenges in existing LLM-based SR methods: 1) embedding collapse when incorporating pre-trained collaborative embeddings and 2) catastrophic forgetting of quantized embeddings when utilizing semantic IDs. These issues dampen the model scalability and lead to suboptimal recommendation performance. Therefore, based on LLMs like Llama3-8B-instruct, we introduce a novel SR framework named MME-SID, which integrates multimodal embeddings and quantized embeddings to mitigate embedding collapse. Additionally, we propose a Multimodal Residual Quantized Variational Autoencoder (MM-RQ-VAE) with maximum mean discrepancy as the reconstruction loss and contrastive learning for alignment, which effectively preserve intra-modal distance information and capture inter-modal correlations, respectively. To further alleviate catastrophic forgetting, we initialize the model with the trained multimodal code embeddings. Finally, we fine-tune the LLM efficiently using LoRA in a multimodal frequency-aware fusion manner. Extensive experiments on three public datasets validate the superior performance of MME-SID thanks to its capability to mitigate embedding collapse and catastrophic forgetting. The implementation code and datasets are publicly available for reproduction: https://github.com/Applied-Machine-Learning-Lab/MME-SID.
中文:MME-SID框架通过整合多模态嵌入和新型量化变分自编码器,解决了序列推荐中的嵌入塌缩和灾难性遗忘问题,并在公开数据集上验证了其优越性能。
English: The MME-SID framework addresses embedding collapse and catastrophic forgetting in sequential recommendation by integrating multimodal embeddings and a novel quantized variational autoencoder, validated through experiments on public datasets.

Authors:Ziyun Zeng, Junhao Zhang, Wei Li, Mike Zheng Shou
Title: Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination
Abstract:
In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models will be available at https://github.com/showlab/DIM.
中文: 该研究提出了Draw-In-Mind (DIM)数据集和模型,通过强化理解模块的设计职责来解决多模态模型中职责分配失衡问题,以较少参数量实现了图像编辑任务的顶尖性能。
English: The study introduces Draw-In-Mind (DIM), a dataset and model that addresses imbalanced responsibilities in unified multimodal models by enhancing the understanding module's design role, achieving state-of-the-art image editing performance with fewer parameters.

Authors:Ziyun Zeng, Junhao Zhang, Wei Li, Mike Zheng Shou
Title: Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing
Abstract:
In recent years, integrating multimodal understanding and generation into a single unified model has emerged as a promising paradigm. While this approach achieves strong results in text-to-image (T2I) generation, it still struggles with precise image editing. We attribute this limitation to an imbalanced division of responsibilities. The understanding module primarily functions as a translator that encodes user instructions into semantic conditions, while the generation module must simultaneously act as designer and painter, inferring the original layout, identifying the target editing region, and rendering the new content. This imbalance is counterintuitive because the understanding module is typically trained with several times more data on complex reasoning tasks than the generation module. To address this issue, we introduce Draw-In-Mind (DIM), a dataset comprising two complementary subsets: (i) DIM-T2I, containing 14M long-context image-text pairs to enhance complex instruction comprehension; and (ii) DIM-Edit, consisting of 233K chain-of-thought imaginations generated by GPT-4o, serving as explicit design blueprints for image edits. We connect a frozen Qwen2.5-VL-3B with a trainable SANA1.5-1.6B via a lightweight two-layer MLP, and train it on the proposed DIM dataset, resulting in DIM-4.6B-T2I/Edit. Despite its modest parameter scale, DIM-4.6B-Edit achieves SOTA or competitive performance on the ImgEdit and GEdit-Bench benchmarks, outperforming much larger models such as UniWorld-V1 and Step1X-Edit. These findings demonstrate that explicitly assigning the design responsibility to the understanding module provides significant benefits for image editing. Our dataset and models are available at https://github.com/showlab/DIM.
中文: 该研究提出了Draw-In-Mind (DIM)数据集和模型,通过强化理解模块的设计职责来解决多模态模型中职责分配失衡问题,以较少参数量实现了图像编辑任务的顶尖性能。
English: The study introduces Draw-In-Mind (DIM), a dataset and model that addresses imbalanced responsibilities in unified multimodal models by enhancing the understanding module's design role, achieving state-of-the-art image editing performance with fewer parameters.

Authors:Yilin Guan, Qingfeng Lan, Sun Fei, Dujian Ding, Devang Acharya, Chi Wang, William Yang Wang, Wenyue Hua
Title: Dynamic Speculative Agent Planning
Abstract:
Despite their remarkable success in complex tasks propelling widespread adoption, large language-model-based agents still face critical deployment challenges due to prohibitive latency and inference costs. While recent work has explored various methods to accelerate inference, existing approaches suffer from significant limitations: they either fail to preserve performance fidelity, require extensive offline training of router modules, or incur excessive operational costs. Moreover, they provide minimal user control over the tradeoff between acceleration and other performance metrics. To address these gaps, we introduce Dynamic Speculative Planning (DSP), an asynchronous online reinforcement learning framework that provides lossless acceleration with substantially reduced costs without requiring additional pre-deployment preparation. DSP explicitly optimizes a joint objective balancing end-to-end latency against dollar cost, allowing practitioners to adjust a single parameter that steers the system toward faster responses, cheaper operation, or any point along this continuum. Experiments on two standard agent benchmarks demonstrate that DSP achieves comparable efficiency to the fastest lossless acceleration method while reducing total cost by 30% and unnecessary cost up to 60%. Our code and data are available through https://github.com/guanyilin428/Dynamic-Speculative-Planning.
Large language model agents face high latency and cost issues, which Dynamic Speculative Planning (DSP) addresses through an online reinforcement learning framework that enables lossless acceleration with 30% cost reduction while allowing adjustable performance trade-offs.
English Summary:

Authors:Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Wenchao Yang, Yitong Yang, Xingyao Zhang, Yingshui Tan, Jialing Tao, Hui Xue
Title: Oyster-I: Beyond Refusal - Constructive Safety Alignment for Responsible Language Models
Abstract:
Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.
中文: 现有大语言模型的安全机制常因防御性拒绝而无法帮助心理脆弱的用户,因此CSA提出以人为中心的安全对齐方法,通过预期推理和信任建立引导高危用户获得安全结果,在开源模型中实现了顶尖的安全性和通用能力。
English: Current LLM safety mechanisms often fail vulnerable users by using defensive refusals, so CSA introduces a human-centric approach that guides at-risk users toward safe outcomes through anticipatory reasoning and trust-building, achieving top safety and capability levels in open models.

Authors:Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, Feng Zhang
Title: RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
Abstract:
Remote sensing is critical for disaster monitoring, yet existing datasets lack temporal image pairs and detailed textual annotations. While single-snapshot imagery dominates current resources, it fails to capture dynamic disaster impacts over time. To address this gap, we introduce the Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods, wildfires, and more) paired with rich, human-like change captions. By bridging the temporal and semantic divide in remote sensing data, RSCC enables robust training and evaluation of vision-language models for disaster-aware bi-temporal understanding. Our results highlight RSCC's ability to facilitate detailed disaster-related analysis, paving the way for more accurate, interpretable, and scalable vision-language applications in remote sensing. Code and dataset are available at https://github.com/Bili-Sakura/RSCC.
中文摘要:RSCC数据集通过提供62,315组灾前/灾后图像对及拟人化变化描述,弥补了遥感数据中时序图像对与详细标注的缺失,为灾害感知的双时相视觉语言模型提供了可靠的训练与评估基础。
English Summary: The RSCC dataset addresses the lack of temporal image pairs and detailed annotations in remote sensing by providing 62,315 pre-/post-disaster image pairs with human-like change captions, enabling robust training of vision-language models for disaster analysis.

Authors:Zhipeng Weng, Xiaopeng Liu, Ce Liu, Xingyuan Guo, Yukai Shi, Liang Lin
Title: DroneSR: Rethinking Few-shot Thermal Image Super-Resolution from Drone-based Perspective
Abstract:
Although large scale models achieve significant improvements in performance, the overfitting challenge still frequently undermines their generalization ability. In super resolution tasks on images, diffusion models as representatives of generative models typically adopt large scale architectures. However, few-shot drone-captured infrared training data frequently induces severe overfitting in large-scale architectures. To address this key challenge, our method proposes a new Gaussian quantization representation learning method oriented to diffusion models that alleviates overfitting and enhances robustness. At the same time, an effective monitoring mechanism tracks large scale architectures during training to detect signs of overfitting. By introducing Gaussian quantization representation learning, our method effectively reduces overfitting while maintaining architecture complexity. On this basis, we construct a multi source drone-based infrared image benchmark dataset for detection and use it to emphasize overfitting issues of large scale architectures in few sample, drone-based diverse drone-based image reconstruction scenarios. To verify the efficacy of the method in mitigating overfitting, experiments are conducted on the constructed benchmark. Experimental results demonstrate that our method outperforms existing super resolution approaches and significantly mitigates overfitting of large scale architectures under complex conditions. The code and DroneSR dataset will be available at: https://github.com/wengzp1/GARLSR.
中文:本文针对无人机拍摄的少量红外图像导致大规模架构过拟合的问题,提出了一种面向扩散模型的高斯量化表示学习方法,通过构建的新基准数据集验证了该方法在抑制过拟合方面优于现有超分辨率技术。
English: This paper introduces a Gaussian quantization representation learning method for diffusion models to mitigate overfitting in large-scale architectures when processing few-shot drone-captured infrared images, validated through a newly constructed benchmark dataset showing superior performance over existing super-resolution approaches.

Authors:Wen Ye, Jinbo Liu, Defu Cao, Wei Yang, Yan Liu
Title: When LLM Meets Time Series: Can LLMs Perform Multi-Step Time Series Reasoning and Inference
Abstract:
The rapid advancement of Large Language Models (LLMs) has sparked growing interest in their application to time series analysis tasks. However, their ability to perform complex reasoning over temporal data in real-world application domains remains underexplored. To move toward this goal, a first step is to establish a rigorous benchmark dataset for evaluation. In this work, we introduce the TSAIA Benchmark, a first attempt to evaluate LLMs as time-series AI assistants. To ensure both scientific rigor and practical relevance, we surveyed over 20 academic publications and identified 33 real-world task formulations. The benchmark encompasses a broad spectrum of challenges, ranging from constraint-aware forecasting to anomaly detection with threshold calibration: tasks that require compositional reasoning and multi-step time series analysis. The question generator is designed to be dynamic and extensible, supporting continuous expansion as new datasets or task types are introduced. Given the heterogeneous nature of the tasks, we adopt task-specific success criteria and tailored inference-quality metrics to ensure meaningful evaluation for each task. We apply this benchmark to assess eight state-of-the-art LLMs under a unified evaluation protocol. Our analysis reveals limitations in current models' ability to assemble complex time series analysis workflows, underscoring the need for specialized methodologies for domain-specific adaptation. Our benchmark is available at https://huggingface.co/datasets/Melady/TSAIA, and the code is available at https://github.com/USC-Melady/TSAIA.
中文: 本研究提出了TSAIA基准来评估大语言模型作为时间序列AI助手的能力,发现尽管涵盖多种现实任务,现有模型在处理复杂时序推理方面仍存在明显局限。
English: This study introduces the TSAIA Benchmark to evaluate Large Language Models as time-series AI assistants, revealing their limitations in handling complex temporal reasoning despite covering diverse real-world tasks.

Authors:Mingxuan Cui, Yilan Jiang, Duo Zhou, Cheng Qian, Yuji Zhang, Qiong Wang
Title: ShortageSim: Simulating Drug Shortages under Information Asymmetry
Abstract:
Drug shortages pose critical risks to patient care and healthcare systems worldwide, yet the effectiveness of regulatory interventions remains poorly understood due to fundamental information asymmetries in pharmaceutical supply chains. We present \textbf{ShortageSim}, the first Large Language Model (LLM)-based multi-agent simulation framework that captures the complex, strategic interactions between drug manufacturers, institutional buyers, and regulatory agencies in response to shortage alerts. Unlike traditional game-theoretic models that assume perfect rationality and complete information, \textbf{ShortageSim} leverages LLMs to simulate bounded-rational decision-making under uncertainty. Through a sequential production game spanning multiple quarters, we model how FDA announcements, both reactive alerts about existing shortages and proactive warnings about potential disruptions, propagate through the supply chain and influence capacity investment and procurement decisions. Our experiments on historical shortage events reveal that \textbf{ShortageSim} reduces the resolution-lag percentage for discontinued-disclosed cases by 83\%, bringing simulated durations more aligned to ground truth than the zero-shot baseline. We open-source \textbf{ShortageSim} and a dataset of 2,925 FDA shortage events at https://github.com/Lemutisme/Sortage_Management, providing a novel computational framework for designing and testing interventions in complex, information-scarce supply chains.
中文: ShortageSim作为首个基于大语言模型的多智能体仿真框架,通过模拟药品供应链在不确定条件下的复杂交互,将短缺解决延迟降低了83%,并整合了FDA真实短缺数据用于干预措施测试。
English: ShortageSim is a novel LLM-based multi-agent simulation framework that models pharmaceutical supply chain interactions under uncertainty, significantly improving shortage resolution accuracy by 83% compared to baselines while incorporating real FDA shortage data.

Authors:Aryan Amit Barsainyan, Jing Yu Lim, Dianbo Liu
Title: STORI: A Benchmark and Taxonomy for Stochastic Environments
Abstract:
Reinforcement learning (RL) techniques have achieved impressive performance on simulated benchmarks such as Atari100k, yet recent advances remain largely confined to simulation and show limited transfer to real-world domains. A central obstacle is environmental stochasticity, as real systems involve noisy observations, unpredictable dynamics, and non-stationary conditions that undermine the stability of current methods. Existing benchmarks rarely capture these uncertainties and favor simplified settings where algorithms can be tuned to succeed. The absence of a well-defined taxonomy of stochasticity further complicates evaluation, as robustness to one type of stochastic perturbation, such as sticky actions, does not guarantee robustness to other forms of uncertainty. To address this critical gap, we introduce STORI (STOchastic-ataRI), a benchmark that systematically incorporates diverse stochastic effects and enables rigorous evaluation of RL techniques under different forms of uncertainty. We propose a comprehensive five-type taxonomy of environmental stochasticity and demonstrate systematic vulnerabilities in state-of-the-art model-based RL algorithms through targeted evaluation of DreamerV3 and STORM. Our findings reveal that world models dramatically underestimate environmental variance, struggle with action corruption, and exhibit unreliable dynamics under partial observability. We release the code and benchmark publicly at https://github.com/ARY2260/stori, providing a unified framework for developing more robust RL systems.
中文摘要:STORI基准通过引入五类随机性分类法,系统评估强化学习在真实环境不确定性下的表现,揭示了DreamerV3和STORM等先进算法在环境方差估计和动态建模方面的系统性缺陷。
English Summary: The STORI benchmark addresses the gap in evaluating reinforcement learning under real-world stochasticity by introducing a five-type taxonomy and revealing vulnerabilities in state-of-the-art algorithms like DreamerV3 and STORM.

Authors:Austin Meek, Carlos H. Mendoza-Cardenas, Austin J. Brockmeier
Title: Convolutional Monge Mapping between EEG Datasets to Support Independent Component Labeling
Abstract:
EEG recordings contain rich information about neural activity but are subject to artifacts, noise, and superficial differences due to sensors, amplifiers, and filtering. Independent component analysis and automatic labeling of independent components (ICs) enable artifact removal in EEG pipelines. Convolutional Monge Mapping Normalization (CMMN) is a recent tool used to achieve spectral conformity of EEG signals, which was shown to improve deep neural network approaches for sleep staging. Here we propose a novel extension of the CMMN method with two alternative approaches to computing the source reference spectrum the target signals are mapped to: (1) channel-averaged and $l_1$-normalized barycenter, and (2) a subject-to-subject mapping that finds the source subject with the closest spectrum to the target subject. Notably, our extension yields space-time separable filters that can be used to map between datasets with different numbers of EEG channels. We apply these filters in an IC classification task, and show significant improvement in recognizing brain versus non-brain ICs. Clinical relevance - EEG recordings are used in the diagnosis and monitoring of multiple neuropathologies, including epilepsy and psychosis. While EEG analysis can benefit from automating artifact removal through independent component analysis and labeling, differences in recording equipment and context (the presence of noise from electrical wiring and other devices) may impact the performance of machine learning models, but these differences can be minimized by appropriate spectral normalization through filtering.
Chinese: 该摘要提出了一种卷积蒙日映射归一化方法的扩展,通过两种计算源参考频谱的方法改进了脑电图信号归一化,产生了可分离的滤波器,从而提高了大脑与非大脑独立成分的分类效果。
English: This abstract introduces an extension to the Convolutional Monge Mapping Normalization method that improves EEG signal normalization by using two approaches for computing the source reference spectrum, resulting in separable filters that enhance independent component classification between brain and non-brain signals.

Authors:Jiahao Qiu, Jingzhe Shi, Xinzhe Juan, Zelin Zhao, Jiayi Geng, Shilong Liu, Hongru Wang, Sanfeng Wu, Mengdi Wang
Title: Physics Supernova: AI Agent Matches Elite Gold Medalists at IPhO 2025
Abstract:
Physics provides fundamental laws that describe and predict the natural world. AI systems aspiring toward more general, real-world intelligence must therefore demonstrate strong physics problem-solving abilities: to formulate and apply physical laws for explaining and predicting physical processes. The International Physics Olympiad (IPhO)--the world's most prestigious physics competition--offers a rigorous benchmark for this purpose. We introduce Physics Supernova, an AI agent system with superior physics problem-solving abilities that match elite IPhO gold medalists. In IPhO 2025 theory problems, Physics Supernova attains 23.5/30 points, ranking 14th of 406 contestants and surpassing the median performance of human gold medalists. We extensively analyzed Physics Supernova's capabilities and flexibility across diverse physics tasks. These results show that principled tool integration within agent systems can deliver competitive improvements in solving challenging science problems. The codes are available at https://github.com/CharlesQ9/Physics-Supernova.
Chinese: Physics Supernova 是一款具备顶尖物理问题解决能力的人工智能系统,在2025年国际物理奥林匹克竞赛理论题中获得23.5/30分,在406名参赛者中排名第14位,其表现媲美人类金牌得主。
English: Physics Supernova is an AI system that demonstrates elite physics problem-solving abilities, matching top International Physics Olympiad gold medalists by scoring 23.5/30 points and ranking 14th among 406 contestants in the 2025 theory problems.

Authors:Zetong Zhou, Dongping Chen, Zixian Ma, Zhihan Hu, Mingyang Fu, Sinan Wang, Yao Wan, Zhou Zhao, Ranjay Krishna
Title: Reinforced Visual Perception with Tools
Abstract:
Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.
Chinese Summary: 本文提出ReVPT方法,通过强化学习训练多模态大语言模型使用四种视觉工具来增强视觉推理能力,在多个感知密集型基准测试中取得最优性能,显著超越监督学习和基于文本的强化学习基线。
English Summary: The paper introduces ReVPT, a reinforcement learning approach that enhances multimodal LLMs' visual reasoning by training them to use four visual tools, achieving state-of-the-art results on perception-heavy benchmarks and outperforming supervised baselines by significant margins.

Authors:Dominic Plein
Title: Parallel Needleman-Wunsch on CUDA to measure word similarity based on phonetic transcriptions
Abstract:
We present a method to calculate the similarity between words based on their phonetic transcription (their pronunciation) using the Needleman-Wunsch algorithm. We implement this algorithm in Rust and parallelize it on both CPU and GPU to handle large datasets efficiently. The GPU implementation leverages CUDA and the cudarc Rust library to achieve significant performance improvements. We validate our approach by constructing a fully-connected graph where nodes represent words and edges have weights according to the similarity between the words. This graph is then analyzed using clustering algorithms to identify groups of phonetically similar words. Our results demonstrate the feasibility and effectiveness of the proposed method in analyzing the phonetic structure of languages. It might be easily expanded to other languages.
中文: 本研究提出一种基于语音转录和Needleman-Wunsch算法的词汇相似度计算方法,通过Rust语言实现CPU和GPU并行处理,有效分析大规模数据并识别语音相似的词汇群组。
English: This study introduces a method for calculating word similarity using phonetic transcriptions with the Needleman-Wunsch algorithm, implemented efficiently in Rust with parallel CPU and GPU processing to analyze large datasets and identify phonetically similar word clusters.

Authors:Ganlin Zhang, Shenhan Qian, Xi Wang, Daniel Cremers
Title: ViSTA-SLAM: Visual SLAM with Symmetric Two-view Association
Abstract:
We present ViSTA-SLAM as a real-time monocular visual SLAM system that operates without requiring camera intrinsics, making it broadly applicable across diverse camera setups. At its core, the system employs a lightweight symmetric two-view association (STA) model as the frontend, which simultaneously estimates relative camera poses and regresses local pointmaps from only two RGB images. This design reduces model complexity significantly, the size of our frontend is only 35\% that of comparable state-of-the-art methods, while enhancing the quality of two-view constraints used in the pipeline. In the backend, we construct a specially designed Sim(3) pose graph that incorporates loop closures to address accumulated drift. Extensive experiments demonstrate that our approach achieves superior performance in both camera tracking and dense 3D reconstruction quality compared to current methods. Github repository: https://github.com/zhangganlin/vista-slam
中文:ViSTA-SLAM是一种无需相机内参的实时单目视觉SLAM系统,采用轻量级对称双视图关联模型进行位姿估计和点云重建,显著降低复杂度,在相机跟踪和稠密三维重建方面性能优越。
English: ViSTA-SLAM is a real-time monocular visual SLAM system that operates without camera intrinsics, using a lightweight symmetric two-view association model for efficient pose estimation and point mapping, achieving superior tracking and 3D reconstruction with reduced complexity.

Authors:Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fan, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang
Title: Kwai Keye-VL 1.5 Technical Report
Abstract:
In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.
中文:Keye-VL-1.5通过慢快双路视频编码、渐进式上下文扩展和强化后训练三大创新,在显著提升视频理解能力的同时保持了优异的多模态性能。
English: Keye-VL-1.5 introduces a Slow-Fast video encoding strategy, progressive context extension, and enhanced post-training to significantly improve video understanding while maintaining strong multimodal performance.

Authors:Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen
Title: Variation-aware Vision Token Dropping for Faster Large Vision-Language Models
Abstract:
Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain \textbf{94.0\%} and \textbf{98.6\%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage.
中文: 本文提出V²Drop方法,通过渐进式剔除变化最小的视觉标记来实现大视觉语言模型的高效计算,在保持图像和视频理解任务性能的同时显著降低延迟和内存消耗。
English: This paper introduces V²Drop, a token compression method that enhances computational efficiency in large vision-language models by progressively dropping visual tokens with minimal variation, maintaining high performance while significantly reducing latency and memory usage.

Authors:Zihao Wang, Enneng Yang, Lu Yin, Shiwei Liu, Li Shen
Title: Model Unmerging: Making Your Models Unmergeable for Secure Model Sharing
Abstract:
Model merging leverages multiple finetuned expert models to construct a multi-task model with low cost, and is gaining increasing attention. However, as a growing number of finetuned models become publicly available, concerns about the safety of model merging have emerged. Unauthorized merging may infringe on developers' rights and risk leaking sensitive personal information. Most existing methods focus on detecting whether a merged model originates from a specific source model, but fail to effectively prevent illegal merging. In this paper, we propose MergeLock, an active protection mechanism that disrupts model parameters to render them unmergeable, thereby directly preventing unauthorized model merging. Specifically, leveraging the inherent symmetry of the attention mechanism in Transformer-based models, we randomly sample two pairs of invertible matrices and apply them to the Query-Key (QK) and Value-Output (VO) branches. This transformation keeps the model's output unchanged while pushing it away from the shared parameter space of other finetuned models. Extensive experiments across both vision and language tasks demonstrate that MergeLock can degrade the performance of merged models by over 95% when a protected model is involved in most cases, demonstrating its effectiveness. Moreover, we further demonstrate that merged models protected by MergeLock cannot be effectively recovered using low-cost restoration methods, further enhancing robustness against unauthorized merging. The code is available at https://github.com/hetailang/Merge-Lock.
中文: 模型合并利用专家模型构建多任务模型,但存在安全风险,本文提出的MergeLock通过干扰参数有效阻止非法合并,在保持原模型性能的同时显著降低合并模型的效果。
English: Model merging combines expert models for multi-task efficiency but raises safety concerns, leading to the development of MergeLock, which disrupts parameters to prevent unauthorized merging while maintaining original model performance.

Authors:Konstantin Mark, Leonard Galustian, Maximilian P. -P. Kovar, Esther Heid
Title: Feynman-Kac-Flow: Inference Steering of Conditional Flow Matching to an Energy-Tilted Posterior
Abstract:
Conditional Flow Matching(CFM) represents a fast and high-quality approach to generative modelling, but in many applications it is of interest to steer the generated samples towards precise requirements. While steering approaches like gradient-based guidance, sequential Monte Carlo steering or Feynman-Kac steering are well established for diffusion models, they have not been extended to flow matching approaches yet. In this work, we formulate this requirement as tilting the output with an energy potential. We derive, for the first time, Feynman-Kac steering for CFM. We evaluate our approach on a set of synthetic tasks, including the generation of tilted distributions in a high-dimensional space, which is a particularly challenging case for steering approaches. We then demonstrate the impact of Feynman-Kac steered CFM on the previously unsolved challenge of generated transition states of chemical reactions with the correct chirality, where the reactants or products can have a different handedness, leading to geometric constraints of the viable reaction pathways connecting reactants and products. Code to reproduce this study is avaiable open-source at https://github.com/heid-lab/fkflow.
中文: 条件流匹配是一种快速生成建模方法,本研究首次将Feynman-Kac引导技术应用于该框架,实现了对生成样本的精确控制,并成功解决了生成具有正确手性化学反应过渡态等具有几何约束的挑战性任务。
English: Conditional Flow Matching (CFM) is a fast generative modeling method, and this work introduces Feynman-Kac steering to CFM for the first time, enabling precise control over generated samples and successfully applying it to challenging tasks like generating chemically accurate transition states with correct chirality.

Authors:Kairong Han, Wenshuo Zhao, Ziyu Zhao, JunJian Ye, Lujia Pan, Kun Kuang
Title: CAT: Causal Attention Tuning For Injecting Fine-grained Causal Knowledge into Large Language Models
Abstract:
Large Language Models (LLMs) have achieved remarkable success across various domains. However, a fundamental question remains: Can LLMs effectively utilize causal knowledge for prediction and generation? Through empirical studies, we find that LLMs trained directly on large-scale data often capture spurious correlations rather than true causal relationships, leading to suboptimal performance, especially in out-of-distribution (OOD) scenarios. To address this challenge, we propose Causal Attention Tuning (CAT), a novel approach that injects fine-grained causal knowledge into the attention mechanism. We propose an automated pipeline that leverages human priors to automatically generate token-level causal signals and introduce the Re-Attention mechanism to guide training, helping the model focus on causal structures while mitigating noise and biases in attention scores. Experimental results on our proposed Spurious Token Game (STG) benchmark and multiple downstream tasks demonstrate that our approach effectively leverages causal knowledge for prediction and remains robust in OOD scenarios. The CAT achieves an average improvement of 5.76% on the STG dataset and 1.56% on downstream tasks. Notably, the OOD performance of the Llama-3.1-8B model on STG_M increased from 64.5% to 90.5%, and Qwen's OOD performance on the STG_H dataset improved from 25.4% to 55.9%. Implementation details can be found at https://github.com/Kairong-Han/CAT.
Chinese: 本研究提出因果注意力调优(CAT)方法,通过将细粒度因果知识注入注意力机制,显著提升大语言模型在分布外场景下的性能和鲁棒性,在STG基准测试及下游任务中均取得了明显改进。
English: The study introduces Causal Attention Tuning (CAT), a method that enhances large language models by integrating fine-grained causal knowledge into their attention mechanisms, significantly improving performance and robustness in out-of-distribution scenarios, as demonstrated by substantial gains on the STG benchmark and downstream tasks.

Authors:Liu Qifeng, Zhao Dawei, Dong Yabo, Xiao Liang, Wang Juan, Min Chen, Li Fuyang, Jiang Weizhong, Lu Dongming, Nie Yiming
Title: PointSlice: Accurate and Efficient Slice-Based Representation for 3D Object Detection from Point Clouds
Abstract:
3D object detection from point clouds plays a critical role in autonomous driving. Currently, the primary methods for point cloud processing are voxel-based and pillarbased approaches. Voxel-based methods offer high accuracy through fine-grained spatial segmentation but suffer from slower inference speeds. Pillar-based methods enhance inference speed but still fall short of voxel-based methods in accuracy. To address these issues, we propose a novel point cloud processing method, PointSlice, which slices point clouds along the horizontal plane and includes a dedicated detection network. The main contributions of PointSlice are: (1) A new point cloud processing technique that converts 3D point clouds into multiple sets of 2D (x-y) data slices. The model only learns 2D data distributions, treating the 3D point cloud as separate batches of 2D data, which reduces the number of model parameters and enhances inference speed; (2) The introduction of a Slice Interaction Network (SIN). To maintain vertical relationships across slices, we incorporate SIN into the 2D backbone network, which improves the model's 3D object perception capability. Extensive experiments demonstrate that PointSlice achieves high detection accuracy and inference speed. On the Waymo dataset, PointSlice is 1.13x faster and has 0.79x fewer parameters than the state-of-the-art voxel-based method (SAFDNet), with only a 1.2 mAPH accuracy reduction. On the nuScenes dataset, we achieve a state-of-the-art detection result of 66.74 mAP. On the Argoverse 2 dataset, PointSlice is 1.10x faster, with 0.66x fewer parameters and a 1.0 mAP accuracy reduction. The code will be available at https://github.com/qifeng22/PointSlice2.
中文: PointSlice提出了一种创新的点云处理方法,通过将三维点云转换为二维切片并结合切片交互网络,在保持高精度的同时显著提升了推理速度,在多个自动驾驶数据集中表现优异。
English: PointSlice introduces a novel 3D object detection method that converts point clouds into 2D slices and employs a Slice Interaction Network to balance high accuracy with improved inference speed across multiple autonomous driving datasets.

Authors:Mo Wang, Kaining Peng, Jingsheng Tang, Hongkai Wen, Quanying Liu
Title: DCA: Graph-Guided Deep Embedding Clustering for Brain Atlases
Abstract:
Brain atlases are essential for reducing the dimensionality of neuroimaging data and enabling interpretable analysis. However, most existing atlases are predefined, group-level templates with limited flexibility and resolution. We present Deep Cluster Atlas (DCA), a graph-guided deep embedding clustering framework for generating individualized, voxel-wise brain parcellations. DCA combines a pretrained autoencoder with spatially regularized deep clustering to produce functionally coherent and spatially contiguous regions. Our method supports flexible control over resolution and anatomical scope, and generalizes to arbitrary brain structures. We further introduce a standardized benchmarking platform for atlas evaluation, using multiple large-scale fMRI datasets. Across multiple datasets and scales, DCA outperforms state-of-the-art atlases, improving functional homogeneity by 98.8% and silhouette coefficient by 29%, and achieves superior performance in downstream tasks such as autism diagnosis and cognitive decoding. We also observe that a fine-tuned pretrained model achieves superior results on the corresponding task. Codes and models are available at https://github.com/ncclab-sustech/DCA .
中文: DCA是一种通过深度聚类生成个性化高分辨率脑区图谱的新框架,在功能同质性和疾病诊断等下游任务中显著优于现有方法。
English: DCA is a novel brain atlas framework that generates individualized, high-resolution parcellations using deep clustering, significantly outperforming existing methods in functional coherence and diagnostic applications.

Authors:Yang Liu, Masahiro Kaneko, Chenhui Chu
Title: On the Alignment of Large Language Models with Global Human Opinion
Abstract:
Today's large language models (LLMs) are capable of supporting multilingual scenarios, allowing users to interact with LLMs in their native languages. When LLMs respond to subjective questions posed by users, they are expected to align with the views of specific demographic groups or historical periods, shaped by the language in which the user interacts with the model. Existing studies mainly focus on researching the opinions represented by LLMs among demographic groups in the United States or a few countries, lacking worldwide country samples and studies on human opinions in different historical periods, as well as lacking discussion on using language to steer LLMs. Moreover, they also overlook the potential influence of prompt language on the alignment of LLMs' opinions. In this study, our goal is to fill these gaps. To this end, we create an evaluation framework based on the World Values Survey (WVS) to systematically assess the alignment of LLMs with human opinions across different countries, languages, and historical periods around the world. We find that LLMs appropriately or over-align the opinions with only a few countries while under-aligning the opinions with most countries. Furthermore, changing the language of the prompt to match the language used in the questionnaire can effectively steer LLMs to align with the opinions of the corresponding country more effectively than existing steering methods. At the same time, LLMs are more aligned with the opinions of the contemporary population. To our knowledge, our study is the first comprehensive investigation of the topic of opinion alignment in LLMs across global, language, and temporal dimensions. Our code and data are publicly available at https://github.com/nlply/global-opinion-alignment.
Chinese: 本研究通过建立基于世界价值观调查的评估框架,首次全面考察大型语言模型在全球、语言和时间维度上与人类观点的对齐情况,发现模型仅与少数国家观点过度对齐,但通过匹配提示语言可有效引导其与对应国家观点对齐,且更符合当代人群观点。
English: This study addresses the gap in evaluating large language models' alignment with human opinions across global, linguistic, and historical dimensions by creating a World Values Survey-based framework, revealing that models over-align with few countries but can be effectively steered using matching prompt languages while showing stronger alignment with contemporary views.

Authors:Wei Lu, Lingyu Zhu, Si-Bao Chen
Title: Unsupervised Ultra-High-Resolution UAV Low-Light Image Enhancement: A Benchmark, Metric and Framework
Abstract:
Low light conditions significantly degrade Unmanned Aerial Vehicles (UAVs) performance in critical applications. Existing Low-light Image Enhancement (LIE) methods struggle with the unique challenges of aerial imagery, including Ultra-High Resolution (UHR), lack of paired data, severe non-uniform illumination, and deployment constraints. To address these issues, we propose three key contributions. First, we present U3D, the first unsupervised UHR UAV dataset for LIE, with a unified evaluation toolkit. Second, we introduce the Edge Efficiency Index (EEI), a novel metric balancing perceptual quality with key deployment factors: speed, resolution, model complexity, and memory footprint. Third, we develop U3LIE, an efficient framework with two training-only designs-Adaptive Pre-enhancement Augmentation (APA) for input normalization and a Luminance Interval Loss (L_int) for exposure control. U3LIE achieves SOTA results, processing 4K images at 23.8 FPS on a single GPU, making it ideal for real-time on-board deployment. In summary, these contributions provide a holistic solution (dataset, metric, and method) for advancing robust 24/7 UAV vision. The code and datasets are available at https://github.com/lwCVer/U3D_Toolkit.
中文摘要:该研究提出U3D无监督超高清数据集、边缘效率指数指标及U3LIE框架,有效提升无人机低光图像性能,实现4K实时处理,为全天候无人机视觉提供完整解决方案。
English Summary: The study introduces U3D, an unsupervised ultra-high-resolution dataset, the Edge Efficiency Index metric, and the U3LIE framework to enhance low-light UAV image performance, achieving real-time 4K processing for robust 24/7 aerial vision.

Authors:Jiayi Gao, Changcheng Hua, Qingchao Chen, Yuxin Peng, Yang Liu
Title: Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement
Abstract:
Identity-preserving text-to-video (IPT2V) generation creates videos faithful to both a reference subject image and a text prompt. While fine-tuning large pretrained video diffusion models on ID-matched data achieves state-of-the-art results on IPT2V, data scarcity and high tuning costs hinder broader improvement. We thus introduce a Training-Free Prompt, Image, and Guidance Enhancement (TPIGE) framework that bridges the semantic gap between the video description and the reference image and design sampling guidance that enhances identity preservation and video quality, achieving performance gains at minimal cost.Specifically, we first propose Face Aware Prompt Enhancement, using GPT-4o to enhance the text prompt with facial details derived from the reference image. We then propose Prompt Aware Reference Image Enhancement, leveraging an identity-preserving image generator to refine the reference image, rectifying conflicts with the text prompt. The above mutual refinement significantly improves input quality before video generation. Finally, we propose ID-Aware Spatiotemporal Guidance Enhancement, utilizing unified gradients to optimize identity preservation and video quality jointly during generation.Our method outperforms prior work and is validated by automatic and human evaluations on a 1000 video test set, winning first place in the ACM Multimedia 2025 Identity-Preserving Video Generation Challenge, demonstrating state-of-the-art performance and strong generality. The code is available at https://github.com/Andyplus1/IPT2V.git.
中文: TPIGE框架通过GPT-4o优化文本提示和图像生成器增强参考图像,在视频生成过程中采用时空引导技术,无需额外训练即可实现最先进的身份保持文本到视频生成效果。
English: The TPIGE framework enhances identity-preserving text-to-video generation by refining prompts and reference images with GPT-4o and an image generator, then applying spatiotemporal guidance during sampling to achieve state-of-the-art results without additional training.

Authors:Qianrui Zhou, Hua Xu, Yifan Wang, Xinzhi Dong, Hanlei Zhang
Title: LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
Abstract:
Understanding human intents from multimodal signals is critical for analyzing human behaviors and enhancing human-machine interactions in real-world scenarios. However, existing methods exhibit limitations in their modality-level reliance, constraining relational reasoning over fine-grained semantics for complex intent understanding. This paper proposes a novel LLM-Guided Semantic Relational Reasoning (LGSRR) method, which harnesses the expansive knowledge of large language models (LLMs) to establish semantic foundations that boost smaller models' relational reasoning performance. Specifically, an LLM-based strategy is proposed to extract fine-grained semantics as guidance for subsequent reasoning, driven by a shallow-to-deep Chain-of-Thought (CoT) that autonomously uncovers, describes, and ranks semantic cues by their importance without relying on manually defined priors. Besides, we formally model three fundamental types of semantic relations grounded in logical principles and analyze their nuanced interplay to enable more effective relational reasoning. Extensive experiments on multimodal intent and dialogue act recognition tasks demonstrate LGSRR's superiority over state-of-the-art methods, with consistent performance gains across diverse semantic understanding scenarios. The complete data and code are available at https://github.com/thuiar/LGSRR.
中文: 本文提出的LGSRR方法利用大语言模型增强多模态意图理解中的语义关系推理,通过链式思维自主提取细粒度语义线索,在多项识别任务中展现出优于现有方法的性能。
English: This paper introduces the LLM-Guided Semantic Relational Reasoning (LGSRR) method, which leverages large language models to enhance relational reasoning for complex multimodal intent understanding, achieving superior performance in recognition tasks without manual priors.

Authors:Oussama Messai, Abbass Zein-Eddine, Abdelouahid Bentamou, Mickaël Picq, Nicolas Duquesne, Stéphane Puydarrieux, Yann Gavet
Title: Image Quality Enhancement and Detection of Small and Dense Objects in Industrial Recycling Processes
Abstract:
This paper tackles two key challenges: detecting small, dense, and overlapping objects (a major hurdle in computer vision) and improving the quality of noisy images, especially those encountered in industrial environments. [1, 2]. Our focus is on evaluating methods built on supervised deep learning. We perform an analysis of these methods, using a newly developed dataset comprising over 10k images and 120k instances. By evaluating their performance, accuracy, and computational efficiency, we identify the most reliable detection systems and highlight the specific challenges they address in industrial applications. This paper also examines the use of deep learning models to improve image quality in noisy industrial environments. We introduce a lightweight model based on a fully connected convolutional network. Additionally, we suggest potential future directions for further enhancing the effectiveness of the model. The repository of the dataset and proposed model can be found at: https://github.com/o-messai/SDOOD, https://github.com/o-messai/DDSRNet
本文针对工业环境中检测小而密集、重叠物体及提升噪声图像质量的问题,采用监督式深度学习方法,通过新数据集评估性能并提出轻量级模型以改善图像质量。
This paper addresses the detection of small, dense, and overlapping objects and the enhancement of noisy images in industrial settings using supervised deep learning, evaluating methods on a new dataset and proposing a lightweight model for improved image quality.

Authors:Meituan LongCat Team, Bayan, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, Chengcheng Han, Chenguang Xi, Chi Zhang, Chong Peng, Chuan Qin, Chuyu Zhang, Cong Chen, Congkui Wang, Dan Ma, Daoru Pan, Defei Bu, Dengchang Zhao, Deyang Kong, Dishan Liu, Feiye Huo, Fengcun Li, Fubao Zhang, Gan Dong, Gang Liu, Gang Xu, Ge Li, Guoqiang Tan, Guoyuan Lin, Haihang Jing, Haomin Fu, Haonan Yan, Haoxing Wen, Haozhe Zhao, Hong Liu, Hongmei Shi, Hongyan Hao, Hongyin Tang, Huantian Lv, Hui Su, Jiacheng Li, Jiahao Liu, Jiahuan Li, Jiajun Yang, Jiaming Wang, Jian Yang, Jianchao Tan, Jiaqi Sun, Jiaqi Zhang, Jiawei Fu, Jiawei Yang, Jiaxi Hu, Jiayu Qin, Jingang Wang, Jiyuan He, Jun Kuang, Junhui Mei, Kai Liang, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Liang Gao, Liang Shi, Lianhui Ma, Lin Qiu, Lingbin Kong, Lingtong Si, Linkun Lyu, Linsen Guo, Liqi Yang, Lizhi Yan, Mai Xia, Man Gao, Manyuan Zhang, Meng Zhou, Mengxia Shen, Mingxiang Tuo, Mingyang Zhu, Peiguang Li, Peng Pei, Peng Zhao, Pengcheng Jia, Pingwei Sun, Qi Gu, Qianyun Li, Qingyuan Li, Qiong Huang, Qiyuan Duan, Ran Meng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shizhe Wu, Shuai Liang, Shuo Wang, Suogui Dang, Tao Fang, Tao Li, Tefeng Chen, Tianhao Bai, Tianhao Zhou, Tingwen Xie, Wei He, Wei Huang, Wei Liu, Wei Shi, Wei Wang, Wei Wu, Weikang Zhao, Wen Zan, Wenjie Shi, Xi Nan, Xi Su, Xiang Li, Xiang Mei, Xiangyang Ji, Xiangyu Xi, Xiangzhou Huang, Xianpeng Li, Xiao Fu, Xiao Liu, Xiao Wei, Xiaodong Cai, Xiaolong Chen, Xiaoqing Liu, Xiaotong Li, Xiaowei Shi, Xiaoyu Li, Xili Wang, Xin Chen, Xing Hu, Xingyu Miao, Xinyan He, Xuemiao Zhang, Xueyuan Hao, Xuezhi Cao, Xunliang Cai, Xurui Yang, Yan Feng, Yang Bai, Yang Chen, Yang Yang, Yaqi Huo, Yerui Sun, Yifan Lu, Yifan Zhang, Yipeng Zang, Yitao Zhai, Yiyang Li, Yongjing Yin, Yongkang Lv, Yongwei Zhou, Yu Yang, Yuchen Xie, Yueqing Sun, Yuewen Zheng, Yuhuai Wei, Yulei Qian, Yunfan Liang, Yunfang Tai, Yunke Zhao, Zeyang Yu, Zhao Zhang, Zhaohua Yang, Zhenchao Zhang, Zhikang Xia, Zhiye Zou, Zhizhao Zeng, Zhongda Su, Zhuofan Chen, Zijian Zhang, Ziwen Wang, Zixu Jiang, Zizhe Zhao, Zongyu Wang, Zunhai Su
Title: LongCat-Flash Technical Report
Abstract:
We introduce LongCat-Flash, a 560-billion-parameter Mixture-of-Experts (MoE) language model designed for both computational efficiency and advanced agentic capabilities. Stemming from the need for scalable efficiency, LongCat-Flash adopts two novel designs: (a) Zero-computation Experts, which enables dynamic computational budget allocation and activates 18.6B-31.3B (27B on average) per token depending on contextual demands, optimizing resource usage. (b) Shortcut-connected MoE, which enlarges the computation-communication overlap window, demonstrating notable gains in inference efficiency and throughput compared to models of a comparable scale. We develop a comprehensive scaling framework for large models that combines hyperparameter transfer, model-growth initialization, a multi-pronged stability suite, and deterministic computation to achieve stable and reproducible training. Notably, leveraging the synergy among scalable architectural design and infrastructure efforts, we complete model training on more than 20 trillion tokens within 30 days, while achieving over 100 tokens per second (TPS) for inference at a cost of \$0.70 per million output tokens. To cultivate LongCat-Flash towards agentic intelligence, we conduct a large-scale pre-training on optimized mixtures, followed by targeted mid- and post-training on reasoning, code, and instructions, with further augmentation from synthetic data and tool use tasks. Comprehensive evaluations demonstrate that, as a non-thinking foundation model, LongCat-Flash delivers highly competitive performance among other leading models, with exceptional strengths in agentic tasks. The model checkpoint of LongCat-Flash is open-sourced to foster community research. LongCat Chat: https://longcat.ai Hugging Face: https://huggingface.co/meituan-longcat GitHub: https://github.com/meituan-longcat
中文: LongCat-Flash 是一个拥有5600亿参数的专家混合模型,通过零计算专家和捷径连接MoE等创新设计实现高效计算,在20万亿令牌上快速完成训练,在智能体任务中表现优异,模型已开源供社区研究。
English: LongCat-Flash is a 560-billion-parameter Mixture-of-Experts model that achieves computational efficiency through novel designs like Zero-computation Experts and Shortcut-connected MoE, enabling rapid training on 20+ trillion tokens and demonstrating strong performance in agentic tasks while being open-sourced for community use.

Authors:Thinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, Lam-Huy Nguyen, Minh-Triet Tran, Trung-Nghia Le
Title: ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization
Abstract:
Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap's effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.
Chinese: ReCap提出了一种事件增强的图像描述生成流程,通过整合文章上下文信息来生成富含叙述性的描述,在EVENTA 2025挑战赛中排名第二,有效解决了传统模型缺乏语境感知的问题。
English: ReCap introduces an event-enriched image captioning pipeline that integrates contextual information from articles to generate narrative-rich captions, addressing the limitations of standard models by ranking 2nd in the EVENTA 2025 challenge.

Authors:Xiangdong Zhang, Shaofeng Zhang, Junchi Yan
Title: Towards More Diverse and Challenging Pre-training for Point Cloud Learning: Self-Supervised Cross Reconstruction with Decoupled Views
Abstract:
Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D self-supervised learning. Specifically, it outperforms the self-reconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the Mlp-Linear evaluation protocol. The code is available at https://github.com/aHapBean/Point-PQAE.
中文摘要:本文提出Point-PQAE双视角交叉重建方法,通过解耦点云生成双视角并设计新型位置编码,在自监督点云学习中实现了比单视角方法更具挑战性的预训练任务,显著提升了性能表现。
English Summary: This paper introduces Point-PQAE, a two-view cross-reconstruction method for self-supervised point cloud learning that surpasses single-view approaches by generating more challenging pre-training tasks through decoupled views and novel positional encoding.

Authors:Yusheng Zheng, Yanpeng Hu, Wei Zhang, Andi Quinn
Title: Towards Agentic OS: An LLM Agent Framework for Linux Schedulers
Abstract:
Operating system schedulers suffer from a fundamental semantic gap, where kernel policies fail to understand application-specific needs, leading to suboptimal performance. We introduce SchedCP, the first framework that enables fully autonomous Large Language Model (LLM) agents to safely and efficiently optimize Linux schedulers without human involvement. Our core insight is that the challenge is not merely to apply a better LLM, but to architect a decoupled control plane that separates the AI's role of semantic reasoning ("what to optimize") from the system's role of execution ("how to observe and act"). Implemented as Model Context Protocol(MCP) server, SchedCP provides a stable interface with three key services: a Workload Analysis Engine, an evolving Scheduler Policy Repository, and an Execution Verifier that validates all AI-generated code and configure before deployment with static and dynamic analysis. We demonstrate this architecture's power with sched-agent, a multi-agent system that autonomously analyzes workloads, synthesizes custom eBPF scheduling policies, and deploys them via the sched\_ext infrastructure. Our evaluation shows that SchedCP achieves up to an 1.79x performance improvement, and a 13x cost reduction compared to naive agentic approaches, all while maintaining high success rate. By bridging the semantic gap, SchedCP democratizes expert-level system optimization and represents a step towards creating truly self-optimizing, application-aware operating systems. The code is open-sourced in https://github.com/eunomia-bpf/schedcp
中文:SchedCP 是一种创新框架,通过分离语义推理与执行,利用自主大型语言模型代理优化 Linux 调度器,在严格验证保障安全的同时,实现了性能显著提升和成本大幅降低。
English: SchedCP is a novel framework that employs autonomous LLM agents to optimize Linux schedulers by decoupling semantic reasoning from execution, achieving significant performance gains and cost reductions while ensuring safety through rigorous verification.

Authors:Yusheng Zheng, Yanpeng Hu, Wei Zhang, Andi Quinn
Title: Towards Agentic OS: An LLM Agent Framework for Linux Schedulers
Abstract:
Operating system schedulers suffer from a fundamental semantic gap, where kernel policies fail to understand application-specific needs, leading to suboptimal performance. We introduce SchedCP, the first framework that enables fully autonomous Large Language Model (LLM) agents to safely and efficiently optimize Linux schedulers without human involvement. Our core insight is that the challenge is not merely to apply a better LLM, but to architect a decoupled control plane that separates the AI's role of semantic reasoning ("what to optimize") from the system's role of execution ("how to observe and act"), thereby separating the optimization problem into two stages: goal-inference and policy-synthesis. Implemented as Model Context Protocol(MCP) server, SchedCP provides a stable interface with three key services: a Workload Analysis Engine, an evolving Scheduler Policy Repository, and an Execution Verifier that validates all AI-generated code and configure before deployment with static and dynamic analysis. We demonstrate this architecture's power with sched-agent, a multi-agent system that autonomously analyzes workloads, synthesizes custom eBPF scheduling policies, and deploys them via the sched\_ext infrastructure. Our evaluation shows that SchedCP achieves up to an 1.79x performance improvement, and a 13x cost reduction compared to naive agentic approaches, all while maintaining high success rate. By bridging the semantic gap, SchedCP democratizes expert-level system optimization and represents a step towards creating truly self-optimizing, application-aware operating systems. The code is open-sourced in https://github.com/eunomia-bpf/schedcp
中文:SchedCP 是一种创新框架,通过分离语义推理与执行,利用自主大型语言模型代理优化 Linux 调度器,在严格验证保障安全的同时,实现了性能显著提升和成本大幅降低。
English: SchedCP is a novel framework that employs autonomous LLM agents to optimize Linux schedulers by decoupling semantic reasoning from execution, achieving significant performance gains and cost reductions while ensuring safety through rigorous verification.

Authors:Yuan Liu, Zhongyin Zhao, Le Tian, Haicheng Wang, Xubing Ye, Yangxiu You, Zilin Yu, Chuhan Wu, Xiao Zhou, Yang Yu, Jie Zhou
Title: POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Abstract:
High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model's conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model is available at https://github.com/Tencent/POINTS-Reader.
中文: 本文提出了一种全自动的两阶段框架,首先生成多样化的合成数据训练文档提取模型,然后通过自改进技术在真实文档上迭代优化,最终开发出性能卓越的POINTS-Reader模型,其表现优于许多现有解决方案。
English: This paper introduces a fully automated, two-stage framework that first generates diverse synthetic data to train a model for document extraction and then iteratively refines it using self-improvement techniques on real documents, resulting in the high-performing POINTS-Reader model that surpasses many existing solutions.

Authors:Tianwei Ye, Yong Ma, Xiaoguang Mei
Title: DcMatch: Unsupervised Multi-Shape Matching with Dual-Level Consistency
Abstract:
Establishing point-to-point correspondences across multiple 3D shapes is a fundamental problem in computer vision and graphics. In this paper, we introduce DcMatch, a novel unsupervised learning framework for non-rigid multi-shape matching. Unlike existing methods that learn a canonical embedding from a single shape, our approach leverages a shape graph attention network to capture the underlying manifold structure of the entire shape collection. This enables the construction of a more expressive and robust shared latent space, leading to more consistent shape-to-universe correspondences via a universe predictor. Simultaneously, we represent these correspondences in both the spatial and spectral domains and enforce their alignment in the shared universe space through a novel cycle consistency loss. This dual-level consistency fosters more accurate and coherent mappings. Extensive experiments on several challenging benchmarks demonstrate that our method consistently outperforms previous state-of-the-art approaches across diverse multi-shape matching scenarios. Code is available at https://github.com/YeTianwei/DcMatch.
中文: DcMatch是一种无监督学习框架,通过形状图注意力网络和双域循环一致性实现鲁棒的多形状匹配,在多个基准测试中超越了现有最优方法。
English: DcMatch is an unsupervised learning framework that utilizes a shape graph attention network and dual-domain cycle consistency to achieve robust and consistent multi-shape matching, outperforming previous state-of-the-art methods.

Authors:Xiaoran Yang, Yuyang Du, Kexin Chen, Soung Chang Liew, Jiamin Lu, Ziyu Guo, Xiaoyan Liu, Qun Yang, Shiqi Xu, Xingyu Fan, Yuchen Pan, Taoyong Cui, Hongyu Deng, Boris Dudder, Jianzhang Pan, Qun Fang, Pheng Ann Heng
Title: IndusGCC: A Data Benchmark and Evaluation Framework for GUI-Based General Computer Control in Industrial Automation
Abstract:
As Industry 4.0 progresses, flexible manufacturing has become a cornerstone of modern industrial systems, with equipment automation playing a pivotal role. However, existing control software for industrial equipment, typically reliant on graphical user interfaces (GUIs) that require human interactions such as mouse clicks or screen touches, poses significant barriers to the adoption of code-based equipment automation. Recently, Large Language Model-based General Computer Control (LLM-GCC) has emerged as a promising approach to automate GUI-based operations. However, industrial settings pose unique challenges, including visually diverse, domain-specific interfaces and mission-critical tasks demanding high precision. This paper introduces IndusGCC, the first dataset and benchmark tailored to LLM-GCC in industrial environments, encompassing 448 real-world tasks across seven domains, from robotic arm control to production line configuration. IndusGCC features multimodal human interaction data with the equipment software, providing robust supervision for GUI-level code generation. Additionally, we propose a novel evaluation framework with functional and structural metrics to assess LLM-generated control scripts. Experimental results on mainstream LLMs demonstrate both the potential of LLM-GCC and the challenges it faces, establishing a strong foundation for future research toward fully automated factories. Our data and code are publicly available at: \href{https://github.com/Golden-Arc/IndustrialLLM}{https://github.com/Golden-Arc/IndustrialLLM.
Chinese Summary: 本文提出了首个针对工业环境中大语言模型通用计算机控制的数据集和基准IndusGCC,通过涵盖7个领域的448个真实任务及新型评估框架,解决了领域特定界面和高精度要求等独特挑战。
English Summary: This paper introduces IndusGCC, the first dataset and benchmark for Large Language Model-based General Computer Control in industrial settings, addressing unique challenges like domain-specific interfaces and high-precision requirements through 448 real-world tasks and a novel evaluation framework.

Authors:Yutian Xiao, Shukuan Wang, Binhao Wang, Zhao Zhang, Yanze Zhang, Shanqi Liu, Chao Feng, Xiang Li, Fuzhen Zhuang
Title: MARS: Modality-Aligned Retrieval for Sequence Augmented CTR Prediction
Abstract:
Click-through rate (CTR) prediction serves as a cornerstone of recommender systems. Despite the strong performance of current CTR models based on user behavior modeling, they are still severely limited by interaction sparsity, especially in low-active user scenarios. To address this issue, data augmentation of user behavior is a promising research direction. However, existing data augmentation methods heavily rely on collaborative signals while overlooking the rich multimodal features of items, leading to insufficient modeling of low-active users. To alleviate this problem, we propose a novel framework \textbf{MARS} (\textbf{M}odality-\textbf{A}ligned \textbf{R}etrieval for \textbf{S}equence Augmented CTR Prediction). MARS utilizes a Stein kernel-based approach to align text and image features into a unified and unbiased semantic space to construct multimodal user embeddings. Subsequently, each low-active user's behavior sequence is augmented by retrieving, filtering, and concentrating the most similar behavior sequence of high-active users via multimodal user embeddings. Validated by extensive offline experiments and online A/B tests, our framework MARS consistently outperforms state-of-the-art baselines and achieves substantial growth on core business metrics within Kuaishou~\footnote{https://www.kuaishou.com/}. Consequently, MARS has been successfully deployed, serving the main traffic for hundreds of millions of users. To ensure reproducibility, we provide anonymous access to the implementation code~\footnote{https://github.com/wangshukuan/MARS}.
中文摘要:MARS框架通过对齐多模态商品特征来增强低活跃用户行为序列,有效解决了CTR预测中的交互稀疏性问题,并在离线和在线测试中均展现出卓越性能。
English Summary: The MARS framework addresses interaction sparsity in CTR prediction by aligning multimodal item features to augment low-active user behavior sequences, demonstrating superior performance in both offline and online tests.

Authors:Bingnan Yang, Mi Zhang, Zhili Zhang, Zhan Zhang, Yuanxin Zhao, Xiangyun Hu, Jianya Gong
Title: SegAssess: Panoramic quality mapping for robust and transferable unsupervised segmentation assessment
Abstract:
High-quality image segmentation is fundamental to pixel-level geospatial analysis in remote sensing, necessitating robust segmentation quality assessment (SQA), particularly in unsupervised settings lacking ground truth. Although recent deep learning (DL) based unsupervised SQA methods show potential, they often suffer from coarse evaluation granularity, incomplete assessments, and poor transferability. To overcome these limitations, this paper introduces Panoramic Quality Mapping (PQM) as a new paradigm for comprehensive, pixel-wise SQA, and presents SegAssess, a novel deep learning framework realizing this approach. SegAssess distinctively formulates SQA as a fine-grained, four-class panoramic segmentation task, classifying pixels within a segmentation mask under evaluation into true positive (TP), false positive (FP), true negative (TN), and false negative (FN) categories, thereby generating a complete quality map. Leveraging an enhanced Segment Anything Model (SAM) architecture, SegAssess uniquely employs the input mask as a prompt for effective feature integration via cross-attention. Key innovations include an Edge Guided Compaction (EGC) branch with an Aggregated Semantic Filter (ASF) module to refine predictions near challenging object edges, and an Augmented Mixup Sampling (AMS) training strategy integrating multi-source masks to significantly boost cross-domain robustness and zero-shot transferability. Comprehensive experiments across 32 datasets derived from 6 sources demonstrate that SegAssess achieves state-of-the-art (SOTA) performance and exhibits remarkable zero-shot transferability to unseen masks, establishing PQM via SegAssess as a robust and transferable solution for unsupervised SQA. The code is available at https://github.com/Yangbn97/SegAssess.
中文: 本文提出SegAssess框架,通过全景质量映射实现像素级分割质量评估,在跨域测试中展现出卓越的零样本迁移能力,成为无监督分割质量评估的突破性解决方案。
English: This paper introduces SegAssess, a novel deep learning framework implementing Panoramic Quality Mapping (PQM) for comprehensive pixel-level segmentation quality assessment, which achieves state-of-the-art performance and exceptional zero-shot transferability across diverse datasets.

Authors:Yun Chu, Qiuhao Wang, Enze Zhou, Qian Liu, Gang Zheng
Title: EZhouNet:A framework based on graph neural network and anchor interval for the respiratory sound event detection
Abstract:
Auscultation is a key method for early diagnosis of respiratory and pulmonary diseases, relying on skilled healthcare professionals. However, the process is often subjective, with variability between experts. As a result, numerous deep learning-based automatic classification methods have emerged, most of which focus on respiratory sound classification. In contrast, research on respiratory sound event detection remains limited. Existing sound event detection methods typically rely on frame-level predictions followed by post-processing to generate event-level outputs, making interval boundaries challenging to learn directly. Furthermore, many approaches can only handle fixed-length audio, limiting their applicability to variable-length respiratory sounds. Additionally, the impact of respiratory sound location information on detection performance has not been extensively explored. To address these issues, we propose a graph neural network-based framework with anchor intervals, capable of handling variable-length audio and providing more precise temporal localization for abnormal respiratory sound events. Our method improves both the flexibility and applicability of respiratory sound detection. Experiments on the SPRSound 2024 and HF Lung V1 datasets demonstrate the effectiveness of the proposed approach, and incorporating respiratory position information enhances the discrimination between abnormal sounds. The reference implementation is available at https://github.com/chumingqian/EzhouNet.
中文: 本文提出了一种基于图神经网络和锚定区间的框架,用于呼吸音事件检测,能处理变长音频并提供精确的时间定位,在SPRSound 2024和HF Lung V1数据集上验证了其有效性。
English: This paper introduces a graph neural network framework with anchor intervals to improve respiratory sound event detection by handling variable-length audio and providing precise temporal localization, validated on SPRSound 2024 and HF Lung V1 datasets.

Authors:Weiren Zhao, Lanfeng Zhong, Xin Liao, Wenjun Liao, Sichuan Zhang, Shaoting Zhang, Guotai Wang
Title: MetaSSL: A General Heterogeneous Loss for Semi-Supervised Medical Image Segmentation
Abstract:
Semi-Supervised Learning (SSL) is important for reducing the annotation cost for medical image segmentation models. State-of-the-art SSL methods such as Mean Teacher, FixMatch and Cross Pseudo Supervision (CPS) are mainly based on consistency regularization or pseudo-label supervision between a reference prediction and a supervised prediction. Despite the effectiveness, they have overlooked the potential noise in the labeled data, and mainly focus on strategies to generate the reference prediction, while ignoring the heterogeneous values of different unlabeled pixels. We argue that effectively mining the rich information contained by the two predictions in the loss function, instead of the specific strategy to obtain a reference prediction, is more essential for SSL, and propose a universal framework MetaSSL based on a spatially heterogeneous loss that assigns different weights to pixels by simultaneously leveraging the uncertainty and consistency information between the reference and supervised predictions. Specifically, we split the predictions on unlabeled data into four regions with decreasing weights in the loss: Unanimous and Confident (UC), Unanimous and Suspicious (US), Discrepant and Confident (DC), and Discrepant and Suspicious (DS), where an adaptive threshold is proposed to distinguish confident predictions from suspicious ones. The heterogeneous loss is also applied to labeled images for robust learning considering the potential annotation noise. Our method is plug-and-play and general to most existing SSL methods. The experimental results showed that it improved the segmentation performance significantly when integrated with existing SSL frameworks on different datasets. Code is available at https://github.com/HiLab-git/MetaSSL.
中文: 提出的MetaSSL框架通过引入空间异质性损失,根据不确定性和一致性为像素分配权重,有效处理标注数据中的噪声,显著提升了多种半监督学习方法在医学图像分割中的性能。
English: The proposed MetaSSL framework enhances semi-supervised medical image segmentation by introducing a spatially heterogeneous loss that assigns pixel-level weights based on uncertainty and consistency, effectively addressing noise in labeled data and improving performance across various SSL methods.

Authors:Guangli Li, Canbiao Wu, Zhehao Zhou, Na Tian, Zhen Liang
Title: MATL-DC: A Multi-domain Aggregation Transfer Learning Framework for EEG Emotion Recognition with Domain-Class Prototype under Unseen Targets
Abstract:
Emotion recognition based on electroencephalography (EEG) signals is increasingly becoming a key research hotspot in affective Brain-Computer Interfaces (aBCIs). However, the current transfer learning model greatly depends on the source domain and target domain data, which hinder the practical application of emotion recognition. Therefore, we propose a Multi-domain Aggregation Transfer Learning framework for EEG emotion recognition with Domain-Class prototype under unseen targets (MATL-DC). We design the feature decoupling module to decouple class-invariant domain features from domain-invariant class features from shallow features. In the model training stage, the multi-domain aggregation mechanism aggregates the domain feature space to form a superdomain, which enhances the characteristics of emotional EEG signals. In each superdomain, we further extract the class prototype representation by class features. In addition, we adopt the pairwise learning strategy to transform the sample classification problem into the similarity problem between sample pairs, which effectively alleviates the influence of label noise. It is worth noting that the target domain is completely unseen during the training process. In the inference stage, we use the trained domain-class prototypes for inference, and then realize emotion recognition. We rigorously validate it on the publicly available databases (SEED, SEED-IV and SEED-V). The results show that the accuracy of MATL-DC model is 84.70\%, 68.11\% and 61.08\%, respectively. MATL-DC achieves comparable or even better performance than methods that rely on both source and target domains. The source code is available at https://github.com/WuCB-BCI/MATL-DC.
中文: 提出的MATL-DC框架通过多域聚合和域类原型处理未见目标域,在训练阶段无需目标域数据的情况下,实现了具有竞争力的脑电情绪识别准确率。
English: The proposed MATL-DC framework advances EEG-based emotion recognition by using multi-domain aggregation and domain-class prototypes to handle unseen target domains, achieving competitive accuracy without requiring target data during training.

Authors:Zhengqiang Zhang, Rongyuan Wu, Lingchen Sun, Lei Zhang
Title: GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
Abstract:
Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $\href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.
Chinese: GPSToken提出了一种高斯参数化的空间自适应标记框架,利用二维高斯动态建模图像区域,实现非均匀标记,并在图像重建和生成任务中取得了最先进的性能。
English: GPSToken introduces a Gaussian parameterized spatially-adaptive tokenization framework that uses 2D Gaussians to dynamically model image regions, enabling non-uniform tokenization and achieving state-of-the-art performance in image reconstruction and generation.

Authors:Shuangyuan Chen, Shuang Wei, Dongxing Xu, Yanhua Long
Title: Noisy Disentanglement with Tri-stage Training for Noise-Robust Speech Recognition
Abstract:
To enhance the performance of end-to-end (E2E) speech recognition systems in noisy or low signal-to-noise ratio (SNR) conditions, this paper introduces NoisyD-CT, a novel tri-stage training framework built on the Conformer-Transducer architecture. The core of NoisyD-CT is a especially designed compact noisy disentanglement (NoisyD) module (adding only 1.71M parameters), integrated between the Conformer blocks and Transducer Decoder to perform deep noise suppression and improve ASR robustness in challenging acoustic noise environments. To fully exploit the noise suppression capability of the NoisyD-CT, we further propose a clean representation consistency loss to align high-level representations derived from noisy speech with those obtained from corresponding clean speech. Together with a noisy reconstruction loss, this consistency alignment enables the NoisyD module to effectively suppress noise while preserving essential acoustic and linguistic features consistent across both clean and noisy conditions, thereby producing cleaner internal representations that enhance ASR performance. Moreover, our tri-stage training strategy is designed to fully leverage the functionalities of both the noisy disentanglement and speech recognition modules throughout the model training process, ultimately maximizing performance gains under noisy conditions. Our experiments are performed on the LibriSpeech and CHiME-4 datasets, extensive results demonstrate that our proposed NoisyD-CT significantly outperforms the competitive Conformer-Transducer baseline, achieving up to 25.7% and 10.6% relative word error rate reductions on simulated and real-world noisy test sets, respectively, while maintaining or even improving performance on clean speech test sets. The source code, model checkpoint and data simulation scripts will be available at https://github.com/litchimo/NoisyD-CT.
中文: 本文提出NoisyD-CT三阶段训练框架,通过紧凑的噪声解耦模块在噪声环境中实现深度噪声抑制并保留关键语音特征,在模拟和真实噪声测试集上显著降低词错误率,同时在纯净语音上保持或提升识别性能。
English: This paper introduces NoisyD-CT, a tri-stage training framework with a compact noisy disentanglement module that enhances end-to-end speech recognition robustness in noisy environments by suppressing noise while preserving essential features, achieving significant word error rate reductions on both simulated and real-world noisy datasets.

Authors:Abdessalam Bouchekif, Samer Rashwani, Heba Sbahi, Shahd Gaben, Mutaz Al-Khatib, Mohammed Ghaly
Title: Assessing Large Language Models on Islamic Legal Reasoning: Evidence from Inheritance Law Evaluation
Abstract:
This paper evaluates the knowledge and reasoning capabilities of Large Language Models in Islamic inheritance law, known as 'ilm al-mawarith. We assess the performance of seven LLMs using a benchmark of 1,000 multiple-choice questions covering diverse inheritance scenarios, designed to test models' ability to understand the inheritance context and compute the distribution of shares prescribed by Islamic jurisprudence. The results reveal a significant performance gap: o3 and Gemini 2.5 achieved accuracies above 90%, whereas ALLaM, Fanar, LLaMA, and Mistral scored below 50%. These disparities reflect important differences in reasoning ability and domain adaptation. We conduct a detailed error analysis to identify recurring failure patterns across models, including misunderstandings of inheritance scenarios, incorrect application of legal rules, and insufficient domain knowledge. Our findings highlight limitations in handling structured legal reasoning and suggest directions for improving performance in Islamic legal reasoning. Code: https://github.com/bouchekif/inheritance_evaluation
中文摘要:本研究评估了七种大型语言模型在伊斯兰继承法领域的表现,结果显示仅有两种模型准确率超过90%,而四种模型低于50%,错误分析揭示了模型在法律规则应用和领域知识方面存在关键推理缺陷。
English Summary: This study evaluates seven large language models on Islamic inheritance law, revealing a significant performance gap where only two models achieved over 90% accuracy while four scored below 50%, with error analysis identifying key reasoning failures in legal rule application and domain knowledge.

Authors:Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen
Title: VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2$\times$ speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.
中文: VerlTool作为一个统一模块化框架,通过标准化API、异步执行和灵活插件架构,解决了现有强化学习方法在多轮工具交互中的效率瓶颈,在六大领域实现优异性能的同时大幅降低了开发门槛。
English: VerlTool is a unified modular framework that overcomes the limitations of existing reinforcement learning approaches by enabling efficient multi-turn tool interactions through standardized APIs, asynchronous execution, and a flexible plugin architecture, achieving competitive performance across six domains while accelerating development.

Authors:Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, Wenhu Chen
Title: VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated success in enhancing LLM reasoning capabilities, but remains limited to single-turn interactions without tool integration. While recent Agentic Reinforcement Learning with Tool use (ARLT) approaches have emerged to address multi-turn tool interactions, existing works develop task-specific codebases that suffer from fragmentation, synchronous execution bottlenecks, and limited extensibility across domains. These inefficiencies hinder broader community adoption and algorithmic innovation. We introduce VerlTool, a unified and modular framework that addresses these limitations through systematic design principles. VerlTool provides four key contributions: (1) upstream alignment with VeRL ensuring compatibility and simplified maintenance, (2) unified tool management via standardized APIs supporting diverse modalities including code execution, search, SQL databases, and vision processing, (3) asynchronous rollout execution achieving near 2$\times$ speedup by eliminating synchronization bottlenecks, and (4) comprehensive evaluation demonstrating competitive performance across 6 ARLT domains. Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms. We train and evaluate models on mathematical reasoning, knowledge QA, SQL generation, visual reasoning, web search, and software engineering tasks, achieving results comparable to specialized systems while providing unified training infrastructure. The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions, significantly reducing development overhead and providing a scalable foundation for tool-augmented RL research. Our code is open-sourced at https://github.com/TIGER-AI-Lab/verl-tool.
中文: VerlTool作为一个统一模块化框架,通过标准化API、异步执行和灵活插件架构,解决了现有强化学习方法在多轮工具交互中的效率瓶颈,在六大领域实现优异性能的同时大幅降低了开发门槛。
English: VerlTool is a unified modular framework that overcomes the limitations of existing reinforcement learning approaches by enabling efficient multi-turn tool interactions through standardized APIs, asynchronous execution, and a flexible plugin architecture, achieving competitive performance across six domains while accelerating development.

Authors:Lun Ai, Johannes Langer, Ute Schmid, Stephen Muggleton
Title: Ultra Strong Machine Learning: Teaching Humans Active Learning Strategies via Automated AI Explanations
Abstract:
Ultra Strong Machine Learning (USML) refers to symbolic learning systems that not only improve their own performance but can also teach their acquired knowledge to quantifiably improve human performance. In this work, we present LENS (Logic Programming Explanation via Neural Summarisation), a neuro-symbolic method that combines symbolic program synthesis with large language models (LLMs) to automate the explanation of machine-learned logic programs in natural language. LENS addresses a key limitation of prior USML approaches by replacing hand-crafted explanation templates with scalable automated generation. Through systematic evaluation using multiple LLM judges and human validation, we demonstrate that LENS generates superior explanations compared to direct LLM prompting and hand-crafted templates. To investigate whether LENS can teach transferable active learning strategies, we carried out a human learning experiment across three related domains. Our results show no significant human performance improvements, suggesting that comprehensive LLM responses may overwhelm users for simpler problems rather than providing learning support. Our work provides a solid foundation for building effective USML systems to support human learning. The source code is available on: https://github.com/lun-ai/LENS.git.
中文: LENS是一种神经符号方法,能自动生成机器学习逻辑程序的自然语言解释,其效果优于传统模板和直接大语言模型提示,但在实验中未显著提升人类学习效果。
English: LENS is a neuro-symbolic method that automates the generation of natural language explanations for machine-learned logic programs, outperforming traditional templates and direct LLM prompting, though it did not significantly enhance human learning in experiments.

Authors:Yutong Gao, Maoyuan Shao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Yu Weng, Xuan Liu, Guoshun Nan
Title: Spotlighter: Revisiting Prompt Tuning from a Representative Mining View
Abstract:
CLIP's success has demonstrated that prompt tuning can achieve robust cross-modal semantic alignment for tasks ranging from open-domain recognition to fine-grained classification. However, redundant or weakly relevant feature components introduce noise and incur unnecessary computational costs. In this work, we propose Spotlighter, a lightweight token-selection framework that simultaneously enhances accuracy and efficiency in prompt tuning. Spotlighter evaluates each visual token's activation from both sample-wise and semantic-wise perspectives and retains only the top-scoring tokens for downstream prediction. A class-specific semantic memory bank of learned prototypes refines this selection, ensuring semantic representativeness and compensating for discarded features. To further prioritize informative signals, we introduce a two-level ranking mechanism that dynamically weights token--prototype interactions. Across 11 few-shot benchmarks, Spotlighter outperforms CLIP by up to 11.19\% in harmonic mean accuracy and achieves up to 0.8K additional FPS, with only 21 extra parameters. These results establish Spotlighter as an effective and scalable baseline for prompt tuning. Code for our method will be available at https://github.com/greatest-gourmet/Spotlighter.
中文: Spotlighter是一种轻量级令牌选择框架,通过双重评估和语义记忆库保留高分视觉令牌,在提示调优中显著提升准确性与效率,成为该领域的有效基准。
English: Spotlighter is a lightweight token-selection framework that enhances prompt tuning by retaining top-scoring visual tokens through dual-perspective evaluation and a semantic memory bank, achieving superior accuracy and efficiency across benchmarks.

Authors:Yong Su, Yiyi Chen, Shenghong Yi, Hui Feng, Yuedong Xu, Wang Xiang, Bo Hu
Title: A Modular and Scalable Simulator for Connected-UAVs Communication in 5G Networks
Abstract:
Cellular-connected UAV systems have enabled a wide range of low-altitude aerial services. However, these systems still face many challenges, such as frequent handovers and the inefficiency of traditional transport protocols. To better study these issues, we develop a modular and scalable simulation platform specifically designed for UAVs communication leveraging the research ecology in wireless communication of MATLAB. The platform supports flexible 5G NR node deployment, customizable UAVs mobility models, and multi-network-interface extensions. It also supports multiple transport protocols including TCP, UDP, QUIC, etc., allowing to investigate how different transport protocols affect UAVs communication performance. In addition, the platform includes a handover management module, enabling the evaluation of both traditional and learning-based handover strategies. Our platform can serve as a testbed for the development and evaluation of advanced transmission strategies in cellular-connected UAV systems.
中文: 本文开发了一个基于MATLAB的模块化仿真平台,旨在解决蜂窝连接无人机系统中的频繁切换和传输协议效率低下等问题,支持灵活的5G NR部署、可定制的移动模型及多种协议评估。
English: This paper introduces a modular simulation platform built on MATLAB to address challenges in cellular-connected UAV systems, such as frequent handovers and inefficient transport protocols, by supporting flexible 5G NR deployment, customizable mobility models, and diverse protocol evaluations.

Authors:Shaina Raza, Maximus Powers, Partha Pratim Saha, Mahveen Raza, Rizwan Qureshi
Title: Prompting Away Stereotypes? Evaluating Bias in Text-to-Image Models for Occupations
Abstract:
Text-to-Image (TTI) models are powerful creative tools but risk amplifying harmful social biases. We frame representational societal bias assessment as an image curation and evaluation task and introduce a pilot benchmark of occupational portrayals spanning five socially salient roles (CEO, Nurse, Software Engineer, Teacher, Athlete). Using five state-of-the-art models: closed-source (DALLE 3, Gemini Imagen 4.0) and open-source (FLUX.1-dev, Stable Diffusion XL Turbo, Grok-2 Image), we compare neutral baseline prompts against fairness-aware controlled prompts designed to encourage demographic diversity. All outputs are annotated for gender (male, female) and race (Asian, Black, White), enabling structured distributional analysis. Results show that prompting can substantially shift demographic representations, but with highly model-specific effects: some systems diversify effectively, others overcorrect into unrealistic uniformity, and some show little responsiveness. These findings highlight both the promise and the limitations of prompting as a fairness intervention, underscoring the need for complementary model-level strategies. We release all code and data for transparency and reproducibility https://github.com/maximus-powers/img-gen-bias-analysis.
中文摘要:文本到图像模型可能放大社会偏见,但通过提示策略可改变人口统计特征的呈现效果,不同模型响应差异显著,既展现了促进公平的潜力也揭示了其局限性。
English Summary: Text-to-Image models can amplify social biases, but prompting strategies can alter demographic portrayals with varying effectiveness across different models, highlighting both potential and limitations for fairness interventions.

Authors:Xueyang Kang, Zhengkang Xiang, Zezheng Zhang, Kourosh Khoshelham
Title: Look Beyond: Two-Stage Scene View Generation via Panorama and Video Diffusion
Abstract:
Novel view synthesis (NVS) from a single image is highly ill-posed due to large unobserved regions, especially for views that deviate significantly from the input. While existing methods focus on consistency between the source and generated views, they often fail to maintain coherence and correct view alignment across long-range or looped trajectories. We propose a model that addresses this by decomposing single-view NVS into a 360-degree scene extrapolation followed by novel view interpolation. This design ensures long-term view and scene consistency by conditioning on keyframes extracted and warped from a generated panoramic representation. In the first stage, a panorama diffusion model learns the scene prior from the input perspective image. Perspective keyframes are then sampled and warped from the panorama and used as anchor frames in a pre-trained video diffusion model, which generates novel views through a proposed spatial noise diffusion process. Compared to prior work, our method produces globally consistent novel views -- even in loop closure scenarios -- while enabling flexible camera control. Experiments on diverse scene datasets demonstrate that our approach outperforms existing methods in generating coherent views along user-defined trajectories. Our implementation is available at https://github.com/YiGuYT/LookBeyond.
中文摘要:该模型通过先外推360度场景再插值新视角的方法,解决了单图像新视角合成中的长期一致性问题,在用户定义轨迹上比现有方法生成更连贯的视图。
English Summary: The proposed model enhances novel view synthesis from a single image by first extrapolating a 360-degree scene and then interpolating novel views, ensuring long-term consistency and superior performance on user-defined trajectories compared to existing methods.

Authors:Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, Lei Zhu
Title: SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3
Abstract:
The DINO family of self-supervised vision models has shown remarkable transferability, yet effectively adapting their representations for segmentation remains challenging. Existing approaches often rely on heavy decoders with multi-scale fusion or complex upsampling, which introduce substantial parameter overhead and computational cost. In this work, we propose SegDINO, an efficient segmentation framework that couples a frozen DINOv3 backbone with a lightweight decoder. SegDINO extracts multi-level features from the pretrained encoder, aligns them to a common resolution and channel width, and utilizes a lightweight MLP head to directly predict segmentation masks. This design minimizes trainable parameters while preserving the representational power of foundation features. Extensive experiments across six benchmarks, including three medical datasets (TN3K, Kvasir-SEG, ISIC) and three natural image datasets (MSD, VMD-D, ViSha), demonstrate that SegDINO consistently achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/script-Yang/SegDINO.
中文摘要:SegDINO是一种高效的分割框架,通过将冻结的DINOv3主干网络与轻量级解码器相结合,在多个基准测试中实现最优性能,同时显著降低计算成本。
English Summary: SegDINO is an efficient segmentation framework combining a frozen DINOv3 backbone with a lightweight decoder that achieves state-of-the-art performance across multiple benchmarks while minimizing computational overhead.

Authors:Xinlei Liu, Tao Hu, Peng Yi, Weitao Han, Jichao Xie, Baolin Li
Title: Sequential Difference Maximization: Generating Adversarial Examples via Multi-Stage Optimization
Abstract:
Efficient adversarial attack methods are critical for assessing the robustness of computer vision models. In this paper, we reconstruct the optimization objective for generating adversarial examples as "maximizing the difference between the non-true labels' probability upper bound and the true label's probability," and propose a gradient-based attack method termed Sequential Difference Maximization (SDM). SDM establishes a three-layer optimization framework of "cycle-stage-step." The processes between cycles and between iterative steps are respectively identical, while optimization stages differ in terms of loss functions: in the initial stage, the negative probability of the true label is used as the loss function to compress the solution space; in subsequent stages, we introduce the Directional Probability Difference Ratio (DPDR) loss function to gradually increase the non-true labels' probability upper bound by compressing the irrelevant labels' probabilities. Experiments demonstrate that compared with previous SOTA methods, SDM not only exhibits stronger attack performance but also achieves higher attack cost-effectiveness. Additionally, SDM can be combined with adversarial training methods to enhance their defensive effects. The code is available at https://github.com/X-L-Liu/SDM.
Chinese: 本文提出序列差异最大化(SDM)方法,通过循环-阶段-步骤的三层优化框架,在压缩真实标签概率的同时提升非真实标签概率上限,相比现有最优方法不仅攻击性能更强且成本效益更高。
English: This paper introduces Sequential Difference Maximization (SDM), a gradient-based adversarial attack method that enhances both attack effectiveness and cost-efficiency by optimizing non-true label probabilities while compressing the true label's probability, outperforming previous state-of-the-art methods.

Authors:Zhenhua Xu, Zhaokun Yan, Binhan Xu, Xin Tong, Haitao Xu, Yourong Chen, Meng Han
Title: Unlocking the Effectiveness of LoRA-FP for Seamless Transfer Implantation of Fingerprints in Downstream Models
Abstract:
With the rapid advancement of large language models (LLMs), safeguarding intellectual property (IP) has become increasingly critical. To address the challenges of high costs and potential contamination in fingerprint integration, we propose LoRA-FP, a lightweight, plug-and-play framework that embeds backdoor fingerprints into LoRA adapters through constrained fine-tuning. This design enables seamless fingerprint transplantation via parameter fusion, eliminating the need for full-parameter updates while preserving model integrity. Experimental results demonstrate that LoRA-FP not only significantly reduces computational overhead compared to conventional approaches but also achieves superior robustness across diverse scenarios, including incremental training and model fusion. Our code and datasets are publicly available at https://github.com/Xuzhenhua55/LoRA-FP.
中文: 针对大语言模型知识产权保护的挑战,LoRA-FP提出了一种轻量级框架,通过约束微调将指纹嵌入LoRA适配器,在显著降低计算成本的同时,保持了多种场景下的鲁棒性。
English: To address intellectual property protection challenges in large language models, LoRA-FP introduces a lightweight framework that embeds fingerprints into LoRA adapters through constrained fine-tuning, significantly reducing computational costs while maintaining robustness across various scenarios.

Authors:Zeyu Li, Annan Shu
Title: Aligned Anchor Groups Guided Line Segment Detector
Abstract:
This paper introduces a novel line segment detector, the Aligned Anchor Groups guided Line Segment Detector (AAGLSD), designed to detect line segments from images with high precision and completeness. The algorithm employs a hierarchical approach to extract candidate pixels with different saliency levels, including regular anchors and aligned anchor groups. AAGLSD initiates from these aligned anchor groups, sequentially linking anchors and updating the currently predicted line segment simultaneously. The final predictions are derived through straightforward validation and merging of adjacent line segments, avoiding complex refinement strategies. AAGLSD is evaluated on various datasets and quantitative experiments demonstrate that the proposed method can effectively extract complete line segments from input images compared to other advanced line segment detectors. The implementation is available at https://github.com/LLiDaBao/AAGLSD.
中文摘要:AAGLSD是一种新型线段检测器,通过对齐锚点组和分层像素提取技术,结合简单验证与合并流程,实现了高精度且完整的线段检测效果,性能优于现有先进方法。
English Summary: AAGLSD is a novel line segment detector that uses aligned anchor groups and hierarchical pixel extraction to achieve high-precision detection with simple validation and merging processes, outperforming existing methods in completeness.

Authors:Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le
Title: EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions
Abstract:
Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.
中文摘要:我们提出的多阶段检索框架融合了密集文档检索、事件感知重排序与多模态匹配技术,通过结合语言推理与视觉检索能力,在EVENTA 2025挑战赛中实现了基于复杂事件描述的图像检索最佳性能。
English Summary: Our multi-stage retrieval framework, integrating dense article retrieval, event-aware reranking, and multimodal matching, achieves top performance in the EVENTA 2025 Challenge by effectively combining language reasoning with visual retrieval for complex event-based image search.

Authors:Amartya Banerjee, Somnath Kar, Anirban Pal, Debabrata Maiti
Title: Valid Property-Enhanced Contrastive Learning for Targeted Optimization & Resampling for Novel Drug Design
Abstract:
Efficiently steering generative models toward pharmacologically relevant regions of chemical space remains a major obstacle in molecular drug discovery under low-data regimes. We present VECTOR+: Valid-property-Enhanced Contrastive Learning for Targeted Optimization and Resampling, a framework that couples property-guided representation learning with controllable molecule generation. VECTOR+ applies to both regression and classification tasks and enables interpretable, data-efficient exploration of functional chemical space. We evaluate on two datasets: a curated PD-L1 inhibitor set (296 compounds with experimental $IC_{50}$ values) and a receptor kinase inhibitor set (2,056 molecules by binding mode). Despite limited training data, VECTOR+ generates novel, synthetically tractable candidates. Against PD-L1 (PDB 5J89), 100 of 8,374 generated molecules surpass a docking threshold of $-15.0$ kcal/mol, with the best scoring $-17.6$ kcal/mol compared to the top reference inhibitor ($-15.4$ kcal/mol). The best-performing molecules retain the conserved biphenyl pharmacophore while introducing novel motifs. Molecular dynamics (250 ns) confirm binding stability (ligand RMSD < $2.5$ angstroms). VECTOR+ generalizes to kinase inhibitors, producing compounds with stronger docking scores than established drugs such as brigatinib and sorafenib. Benchmarking against JT-VAE and MolGPT across docking, novelty, uniqueness, and Tanimoto similarity highlights the superior performance of our method. These results position our work as a robust, extensible approach for property-conditioned molecular design in low-data settings, bridging contrastive learning and generative modeling for reproducible, AI-accelerated discovery.
中文: VECTOR+是一种创新框架,通过结合对比学习与生成模型,在低数据条件下高效设计具有药理相关性的分子,相比现有方法能生成更稳定、新颖且具有更强对接活性的化合物,展现出卓越性能。
English: VECTOR+ is a novel framework that integrates contrastive learning with generative modeling to efficiently design pharmacologically relevant molecules in low-data scenarios, demonstrating superior performance in generating stable and novel compounds with enhanced docking scores compared to existing methods.

Authors:Tung Nguyen, Harkanwar Singh, Nilay Naharas, Lucas Bandarkar, Aditya Grover
Title: IndiaWeatherBench: A Dataset and Benchmark for Data-Driven Regional Weather Forecasting over India
Abstract:
Regional weather forecasting is a critical problem for localized climate adaptation, disaster mitigation, and sustainable development. While machine learning has shown impressive progress in global weather forecasting, regional forecasting remains comparatively underexplored. Existing efforts often use different datasets and experimental setups, limiting fair comparison and reproducibility. We introduce IndiaWeatherBench, a comprehensive benchmark for data-driven regional weather forecasting focused on the Indian subcontinent. IndiaWeatherBench provides a curated dataset built from high-resolution regional reanalysis products, along with a suite of deterministic and probabilistic metrics to facilitate consistent training and evaluation. To establish strong baselines, we implement and evaluate a range of models across diverse architectures, including UNets, Transformers, and Graph-based networks, as well as different boundary conditioning strategies and training objectives. While focused on India, IndiaWeatherBench is easily extensible to other geographic regions. We open-source all raw and preprocessed datasets, model implementations, and evaluation pipelines to promote accessibility and future development. We hope IndiaWeatherBench will serve as a foundation for advancing regional weather forecasting research. Code is available at https://github.com/tung-nd/IndiaWeatherBench.
Chinese: 印度气象基准(IndiaWeatherBench)被提出作为一个全面的数据驱动区域天气预报基准,为印度次大陆提供精选数据集、评估指标和基线模型,以推动这一相对未充分探索领域的研究进展。
English: IndiaWeatherBench is introduced as a comprehensive benchmark for data-driven regional weather forecasting in India, providing curated datasets, evaluation metrics, and baseline models to advance research in this underexplored area.

Authors:Md Tanzib Hosain, Md Kishor Morol
Title: Can Multi-turn Self-refined Single Agent LMs with Retrieval Solve Hard Coding Problems?
Abstract:
Among the hardest tasks for humans are those found in competitive programming where problems require sophisticated algorithmic thinking, puzzle solving, and the creation of effective code. As a domain to assess language models (LMs), it has not received enough attention, though. This study presents the ICPC benchmark, which consists of 254 international collegiate programming contest (ICPC) tasks. Each problem includes official analysis, reference code, and sample, high-quality unit, and hidden tests. We are able to develop and evaluate a variety of LM inference techniques for competitive programming with these resources. With zero-shot chain-of-thought prompting, we find that o1 only achieves a 19.1\% pass@1 solve rate. With our best inference technique, which combines multi-turn self-judge with reflection and retrieval over episodic information, raises this to 42.2\%. Furthermore, we conduct a new human-in-the-loop investigation to gain a deeper understanding of the remaining difficulties. Surprisingly, we discover that o1 can solve 17 out of 18 problems that were previously unsolvable by any model or technique with just a few specific instructions. A footstep toward LMs with grounded, imaginative, and algorithmic thinking is provided by our quantitative findings and qualitative research. We open-source our code and data at https://github.com/kraritt/zolve.
中文: 本研究提出ICPC基准以评估语言模型在编程竞赛中的表现,发现高级推理技术能显著提升解题率,并揭示针对性指导可帮助模型突破先前无法解决的难题。
English: This study introduces the ICPC benchmark for evaluating language models in competitive programming, showing that advanced inference techniques significantly improve solve rates and revealing that targeted guidance enables models to overcome previously unsolvable problems.

Authors:Maggie Chen, Hala Lambdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini
Title: Towards Methane Detection Onboard Satellites
Abstract:
Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.
中文摘要:一种利用未正射校正卫星数据的新型机器学习方法,在减少预处理需求的同时,实现了与传统正射校正方法相当的甲烷检测性能。
English summary: A new machine learning approach using unorthorectified satellite data achieves methane detection performance comparable to traditional orthorectified methods while reducing preprocessing requirements.

Authors:Maggie Chen, Hala Lambdouar, Luca Marini, Laura Martínez-Ferrer, Chris Bridges, Giacomo Acciarini
Title: Towards Methane Detection Onboard Satellites
Abstract:
Methane is a potent greenhouse gas and a major driver of climate change, making its timely detection critical for effective mitigation. Machine learning (ML) deployed onboard satellites can enable rapid detection while reducing downlink costs, supporting faster response systems. Conventional methane detection methods often rely on image processing techniques, such as orthorectification to correct geometric distortions and matched filters to enhance plume signals. We introduce a novel approach that bypasses these preprocessing steps by using \textit{unorthorectified} data (UnorthoDOS). We find that ML models trained on this dataset achieve performance comparable to those trained on orthorectified data. Moreover, we also train models on an orthorectified dataset, showing that they can outperform the matched filter baseline (mag1c). We release model checkpoints and two ML-ready datasets comprising orthorectified and unorthorectified hyperspectral images from the Earth Surface Mineral Dust Source Investigation (EMIT) sensor at https://huggingface.co/datasets/SpaceML/UnorthoDOS , along with code at https://github.com/spaceml-org/plume-hunter.
中文摘要:一种利用未正射校正卫星数据的新型机器学习方法,在减少预处理需求的同时,实现了与传统正射校正方法相当的甲烷检测性能。
English summary: A new machine learning approach using unorthorectified satellite data achieves methane detection performance comparable to traditional orthorectified methods while reducing preprocessing requirements.

Authors:Shiqiao Zhou, Holger Schöner, Huanbo Lyu, Edouard Fouché, Shuo Wang
Title: BALM-TSF: Balanced Multimodal Alignment for LLM-Based Time Series Forecasting
Abstract:
Time series forecasting is a long-standing and highly challenging research topic. Recently, driven by the rise of large language models (LLMs), research has increasingly shifted from purely time series methods toward harnessing textual modalities to enhance forecasting performance. However, the vast discrepancy between text and temporal data often leads current multimodal architectures to over-emphasise one modality while neglecting the other, resulting in information loss that harms forecasting performance. To address this modality imbalance, we introduce BALM-TSF (Balanced Multimodal Alignment for LLM-Based Time Series Forecasting), a lightweight time series forecasting framework that maintains balance between the two modalities. Specifically, raw time series are processed by the time series encoder, while descriptive statistics of raw time series are fed to an LLM with learnable prompt, producing compact textual embeddings. To ensure balanced cross-modal context alignment of time series and textual embeddings, a simple yet effective scaling strategy combined with a contrastive objective then maps these textual embeddings into the latent space of the time series embeddings. Finally, the aligned textual semantic embeddings and time series embeddings are together integrated for forecasting. Extensive experiments on standard benchmarks show that, with minimal trainable parameters, BALM-TSF achieves state-of-the-art performance in both long-term and few-shot forecasting, confirming its ability to harness complementary information from text and time series. Code is available at https://github.com/ShiqiaoZhou/BALM-TSF.
中文:BALM-TSF是一种轻量级多模态框架,通过对比对齐平衡时间序列与文本嵌入,以极少的参数实现了顶尖的预测性能。
English: BALM-TSF is a lightweight multimodal framework that balances time series and text embeddings through contrastive alignment, achieving state-of-the-art forecasting performance with minimal parameters.

Authors:Osama Abu Hamdan, Hao Che, Engin Arslan, Md Arifuzzaman
Title: FLEET: A Federated Learning Emulation and Evaluation Testbed for Holistic Research
Abstract:
Federated Learning (FL) presents a robust paradigm for privacy-preserving, decentralized machine learning. However, a significant gap persists between the theoretical design of FL algorithms and their practical performance, largely because existing evaluation tools often fail to model realistic operational conditions. Many testbeds oversimplify the critical dynamics among algorithmic efficiency, client-level heterogeneity, and continuously evolving network infrastructure. To address this challenge, we introduce the Federated Learning Emulation and Evaluation Testbed (FLEET). This comprehensive platform provides a scalable and configurable environment by integrating a versatile, framework-agnostic learning component with a high-fidelity network emulator. FLEET supports diverse machine learning frameworks, customizable real-world network topologies, and dynamic background traffic generation. The testbed collects holistic metrics that correlate algorithmic outcomes with detailed network statistics. By unifying the entire experiment configuration, FLEET enables researchers to systematically investigate how network constraints, such as limited bandwidth, high latency, and packet loss, affect the convergence and efficiency of FL algorithms. This work provides the research community with a robust tool to bridge the gap between algorithmic theory and real-world network conditions, promoting the holistic and reproducible evaluation of federated learning systems.
Chinese Summary: 联邦学习仿真与评估测试平台(FLEET)通过集成框架无关的学习组件和高保真网络模拟器,提供了一个可扩展的配置环境,能够系统研究网络限制对联邦学习算法收敛和效率的影响,从而弥补理论设计与实际性能之间的差距。
English Summary: Federated Learning Emulation and Evaluation Testbed (FLEET) is introduced as a comprehensive platform that bridges the gap between theoretical FL algorithms and practical performance by integrating a framework-agnostic learning component with a high-fidelity network emulator, enabling systematic investigation of network constraints on FL efficiency.

Authors:Yannick Kirchhoff, Maximilian Rokuss, Fabian Isensee, Klaus H. Maier-Hein
Title: Promptable Longitudinal Lesion Segmentation in Whole-Body CT
Abstract:
Accurate segmentation of lesions in longitudinal whole-body CT is essential for monitoring disease progression and treatment response. While automated methods benefit from incorporating longitudinal information, they remain limited in their ability to consistently track individual lesions across time. Task 2 of the autoPET/CT IV Challenge addresses this by providing lesion localizations and baseline delineations, framing the problem as longitudinal promptable segmentation. In this work, we extend the recently proposed LongiSeg framework with promptable capabilities, enabling lesion-specific tracking through point and mask interactions. To address the limited size of the provided training set, we leverage large-scale pretraining on a synthetic longitudinal CT dataset. Our experiments show that pretraining substantially improves the ability to exploit longitudinal context, yielding an improvement of up to 6 Dice points compared to models trained from scratch. These findings demonstrate the effectiveness of combining longitudinal context with interactive prompting for robust lesion tracking. Code is publicly available at https://github.com/MIC-DKFZ/LongiSeg/tree/autoPET.
中文摘要:本研究通过引入可提示分割功能扩展了LongiSeg框架,利用合成数据的大规模预训练,在纵向CT扫描中实现病灶精准追踪,将模型性能提升最高达6个Dice值点。
English Summary: The study enhances the LongiSeg framework with promptable segmentation for tracking individual lesions in longitudinal CT scans, using large-scale pretraining on synthetic data to improve performance by up to 6 Dice points.

Authors:Osama Abu Hamdan, Hao Che, Engin Arslan, Md Arifuzzaman
Title: SmartFLow: A Communication-Efficient SDN Framework for Cross-Silo Federated Learning
Abstract:
Cross-silo Federated Learning (FL) enables multiple institutions to collaboratively train machine learning models while preserving data privacy. In such settings, clients repeatedly exchange model weights with a central server, making the overall training time highly sensitive to network performance. However, conventional routing methods often fail to prevent congestion, leading to increased communication latency and prolonged training. Software-Defined Networking (SDN), which provides centralized and programmable control over network resources, offers a promising way to address this limitation. To this end, we propose SmartFLow, an SDN-based framework designed to enhance communication efficiency in cross-silo FL. SmartFLow dynamically adjusts routing paths in response to changing network conditions, thereby reducing congestion and improving synchronization efficiency. Experimental results show that SmartFLow decreases parameter synchronization time by up to 47% compared to shortest-path routing and 41% compared to capacity-aware routing. Furthermore, it achieves these gains with minimal computational overhead and scales effectively to networks of up to 50 clients, demonstrating its practicality for real-world FL deployments.
中文: SmartFLow是一种基于SDN的框架,通过动态调整路由路径减少拥塞,将参数同步时间最多缩短47%,以最小开销显著提升跨机构联邦学习的通信效率。
English: SmartFLow, an SDN-based framework, dynamically optimizes routing paths to reduce congestion and cut parameter synchronization time by up to 47%, enhancing communication efficiency in cross-silo federated learning with minimal overhead.

Authors:Saumya Chaturvedi, Aman Chadha, Laurent Bindschaedler
Title: SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction
Abstract:
Converting natural language queries into SQL queries is a crucial challenge in both industry and academia, aiming to increase access to databases and large-scale applications. This work examines how in-context learning and chain-of-thought can be utilized to develop a robust solution for text-to-SQL systems. We propose SQL-of-Thought: a multi-agent framework that decomposes the Text2SQL task into schema linking, subproblem identification, query plan generation, SQL generation, and a guided correction loop. Unlike prior systems that rely only on execution-based static correction, we introduce taxonomy-guided dynamic error modification informed by in-context learning. SQL-of-Thought achieves state-of-the-art results on the Spider dataset and its variants, combining guided error taxonomy with reasoning-based query planning.
中文: 本文提出SQL-of-Thought多智能体框架,通过任务分解和动态纠错机制改进文本到SQL的转换,在Spider数据集上取得了最优性能。
English: This paper introduces SQL-of-Thought, a multi-agent framework that enhances text-to-SQL conversion by decomposing tasks and incorporating dynamic error correction, achieving state-of-the-art performance on the Spider dataset.

Authors:Filip J. Kucia, Bartosz Grabek, Szymon D. Trochimiak, Anna Wróblewska
Title: How to Make Museums More Interactive? Case Study of Artistic Chatbot
Abstract:
Conversational agents powered by Large Language Models (LLMs) are increasingly utilized in educational settings, in particular in individual closed digital environments, yet their potential adoption in the physical learning environments like cultural heritage sites, museums, and art galleries remains relatively unexplored. In this study, we present Artistic Chatbot, a voice-to-voice RAG-powered chat system to support informal learning and enhance visitor engagement during a live art exhibition celebrating the 15th anniversary of the Faculty of Media Art at the Warsaw Academy of Fine Arts, Poland. The question answering (QA) chatbot responded to free-form spoken questions in Polish using the context retrieved from a curated, domain-specific knowledge base consisting of 226 documents provided by the organizers, including faculty information, art magazines, books, and journals. We describe the key aspects of the system architecture and user interaction design, as well as discuss the practical challenges associated with deploying chatbots at public cultural sites. Our findings, based on interaction analysis, demonstrate that chatbots such as Artistic Chatbot effectively maintain responses grounded in exhibition content (60\% of responses directly relevant), even when faced with unpredictable queries outside the target domain, showing their potential for increasing interactivity in public cultural sites. GitHub project page: https://github.com/cinekucia/artistic-chatbot-cikm2025
中文: 本研究开发的Artistic Chatbot语音问答系统通过检索增强生成技术,在艺术展览中为访客提供基于专业知识的回答,有效提升了文化场所的互动体验,展现了在实体学习环境中应用的可行性。
English: This study introduces Artistic Chatbot, a voice-based RAG system that effectively supports informal learning at cultural sites by providing contextually grounded responses, demonstrating its potential to enhance visitor engagement in physical settings like art exhibitions.

Authors:Peirong Liu, Oula Puonti, Xiaoling Hu, Karthik Gopinath, Annabel Sorby-Adams, Daniel C. Alexander, W. Taylor Kimberly, Juan E. Iglesias
Title: A Modality-agnostic Multi-task Foundation Model for Human Brain Imaging
Abstract:
Recent learning-based approaches have made astonishing advances in calibrated medical imaging like computerized tomography (CT), yet they struggle to generalize in uncalibrated modalities -- notably magnetic resonance (MR) imaging, where performance is highly sensitive to the differences in MR contrast, resolution, and orientation. This prevents broad applicability to diverse real-world clinical protocols. Here we introduce BrainFM, a modality-agnostic, multi-task vision foundation model for human brain imaging. With the proposed "mild-to-severe" intra-subject generation and "real-synth" mix-up training strategy, BrainFM is resilient to the appearance of acquired images (e.g., modality, contrast, deformation, resolution, artifacts), and can be directly applied to five fundamental brain imaging tasks, including image synthesis for CT and T1w/T2w/FLAIR MRI, anatomy segmentation, scalp-to-cortical distance, bias field estimation, and registration. We evaluate the efficacy of BrainFM on eleven public datasets, and demonstrate its robustness and effectiveness across all tasks and input modalities. Code is available at https://github.com/jhuldr/BrainFM.
中文:BrainFM是一种适用于人脑成像的多任务视觉基础模型,通过创新的训练策略解决了未校准模态的泛化难题,能在多种临床数据上稳定执行五项核心成像任务。
English: BrainFM is a versatile vision foundation model for brain imaging that overcomes generalization challenges in uncalibrated modalities like MRI through innovative training strategies, enabling robust performance across five key imaging tasks on diverse clinical data.

Authors:Yasser Benigmim, Subhankar Roy, Khalid Oublal, Imad Eddine Marouf, Slim Essid, Vicky Kalogeiton, Stéphane Lathuilière
Title: Make me an Expert: Distilling from Generalist Black-Box Models into Specialized Models for Semantic Segmentation
Abstract:
The rise of Artificial Intelligence as a Service (AIaaS) democratizes access to pre-trained models via Application Programming Interfaces (APIs), but also raises a fundamental question: how can local models be effectively trained using black-box models that do not expose their weights, training data, or logits, a constraint in which current domain adaptation paradigms are impractical ? To address this challenge, we introduce the Black-Box Distillation (B2D) setting, which enables local model adaptation under realistic constraints: (1) the API model is open-vocabulary and trained on large-scale general-purpose data, and (2) access is limited to one-hot predictions only. We identify that open-vocabulary models exhibit significant sensitivity to input resolution, with different object classes being segmented optimally at different scales, a limitation termed the "curse of resolution". Our method, ATtention-Guided sCaler (ATGC), addresses this challenge by leveraging DINOv2 attention maps to dynamically select optimal scales for black-box model inference. ATGC scores the attention maps with entropy to identify informative scales for pseudo-labelling, enabling effective distillation. Experiments demonstrate substantial improvements under black-box supervision across multiple datasets while requiring only one-hot API predictions. Our code is available at https://github.com/yasserben/ATGC.
AIaaS通过API提供预训练模型,但在无法获取模型内部参数的情况下增加了本地模型训练的难度,因此提出了黑盒蒸馏(B2D)框架和ATGC方法,利用注意力引导的动态尺度选择优化伪标签生成,实现高效知识蒸馏。
AIaaS enables access to pre-trained models via APIs but complicates local model training without access to model internals, leading to the proposed Black-Box Distillation (B2D) setting and ATGC method that uses attention-guided scaling to optimize pseudo-labeling for effective distillation.

Authors:Saksorn Ruangtanusak, Pittawat Taveekitworachai, Kunat Pipatanakul
Title: Talk Less, Call Right: Enhancing Role-Play LLM Agents with Automatic Prompt Optimization and Role Prompting
Abstract:
This report investigates approaches for prompting a tool-augmented large language model (LLM) to act as a role-playing dialogue agent in the API track of the Commonsense Persona-grounded Dialogue Challenge (CPDC) 2025. In this setting, dialogue agents often produce overly long in-character responses (over-speaking) while failing to use tools effectively according to the persona (under-acting), such as generating function calls that do not exist or making unnecessary tool calls before answering. We explore four prompting approaches to address these issues: 1) basic role prompting, 2) human-crafted role prompting, 3) automatic prompt optimization (APO), and 4) rule-based role prompting. The rule-based role prompting (RRP) approach achieved the best performance through two novel techniques--character-card/scene-contract design and strict enforcement of function calling--which led to an overall score of 0.571, improving on the zero-shot baseline score of 0.519. These findings demonstrate that RRP design can substantially improve the effectiveness and reliability of role-playing dialogue agents compared with more elaborate methods such as APO. To support future efforts in developing persona prompts, we are open-sourcing all of our best-performing prompts and the APO tool. Source code is available at https://github.com/scb-10x/apo.
中文: 本研究探索了四种提示方法,通过角色卡片设计和严格函数调用优化角色扮演对话代理的过度发言和行动不足问题,其中基于规则的提示方法表现最佳。
English: This study explores four prompting methods to enhance role-playing dialogue agents by addressing over-speaking and under-acting issues, with rule-based role prompting achieving the best performance through character-card design and strict function enforcement.

Authors:Xiang Chen, Renjiu Hu, Jinwei Zhang, Yuxi Zhang, Xinyao Yue, Min Liu, Yaonan Wang, Hang Zhang
Title: Encoder-Only Image Registration
Abstract:
Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR's effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is publicly available on https://github.com/XiangChen1994/EOIR.
Chinese: 提出的仅编码器图像配准(EOIR)框架通过轻量级卷积网络将特征学习与流估计分离,在可变形图像配准中实现了精度与效率、精度与平滑度的更优平衡。
English: The proposed Encoder-Only Image Registration (EOIR) framework separates feature learning from flow estimation using a lightweight convolutional network to achieve superior accuracy-efficiency and accuracy-smoothness trade-offs in deformable image registration.

Authors:Xuechao Zou, Shun Zhang, Xing Fu, Yue Li, Kai Li, Yushe Cao, Congyan Lang, Pin Tao, Junliang Xing
Title: Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
Abstract:
Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability. Project page is available at https://github.com/XavierJiezou/Face-MoGLE.
中文: Face-MoGLE提出了一种新颖的扩散变换器框架,通过语义解耦的潜在建模和全局-局部专家混合机制,实现了对人脸生成的精细控制,在高质量输出和强泛化能力方面表现卓越。
English: Face-MoGLE introduces a novel diffusion transformer framework that enables precise, fine-grained control over face generation through semantic-decoupled latent modeling and a mixture of global-local experts, achieving high-quality results with robust generalization.

Authors:Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang
Title: Metis: Training Large Language Models with Advanced Low-Bit Quantization
Abstract:
This work identifies anisotropic parameter distributions as a fundamental barrier to training large language models (LLMs) with low-bit quantization: a few dominant singular values create wide numerical ranges that conflict with the inherent bias of block-wise quantization. This bias disproportionately preserves high-magnitude values while discarding smaller ones, causing training instability and low model performance. This work introduces Metis, a training framework that combines (i) spectral decomposition with random embedding to efficiently disentangle dominant from long-tail components, compressing broad distributions into quantization-friendly narrow ranges; (ii) adaptive learning rates in the spectral domain to amplify underrepresented directions and better capture diverse features critical for performance; and (iii) a dual-range regularizer that jointly constrains numerical precision and parameter range distribution, ensuring stable, unbiased low-bit training. With Metis, FP8 training surpasses FP32 baselines, and FP4 training achieves accuracy comparable to FP32, paving the way for robust and scalable LLM training under advanced low-bit quantization. The code implementation for Metis is available at: https://github.com/sii-research/Metis.
中文摘要:本研究识别出参数分布的各向异性是低比特量化训练的主要障碍,并提出Metis框架,通过谱分解、自适应学习率和双范围正则化技术,使FP4/FP8训练在保持稳定性的同时达到FP32基准性能。
English Summary: This study identifies anisotropic parameter distributions as a key obstacle in low-bit LLM quantization and introduces Metis, a training framework that employs spectral decomposition, adaptive learning rates, and dual-range regularization to enable stable FP4/FP8 training matching FP32 performance.

Authors:Minku Kang, Hogun Park
Title: Curriculum Guided Personalized Subgraph Federated Learning
Abstract:
Subgraph Federated Learning (FL) aims to train Graph Neural Networks (GNNs) across distributed private subgraphs, but it suffers from severe data heterogeneity. To mitigate data heterogeneity, weighted model aggregation personalizes each local GNN by assigning larger weights to parameters from clients with similar subgraph characteristics inferred from their current model states. However, the sparse and biased subgraphs often trigger rapid overfitting, causing the estimated client similarity matrix to stagnate or even collapse. As a result, aggregation loses effectiveness as clients reinforce their own biases instead of exploiting diverse knowledge otherwise available. To this end, we propose a novel personalized subgraph FL framework called Curriculum guided personalized sUbgraph Federated Learning (CUFL). On the client side, CUFL adopts Curriculum Learning (CL) that adaptively selects edges for training according to their reconstruction scores, exposing each GNN first to easier, generic cross-client substructures and only later to harder, client-specific ones. This paced exposure prevents early overfitting to biased patterns and enables gradual personalization. By regulating personalization, the curriculum also reshapes server aggregation from exchanging generic knowledge to propagating client-specific knowledge. Further, CUFL improves weighted aggregation by estimating client similarity using fine-grained structural indicators reconstructed on a random reference graph. Extensive experiments on six benchmark datasets confirm that CUFL achieves superior performance compared to relevant baselines. Code is available at https://github.com/Kang-Min-Ku/CUFL.git.
中文摘要:CUFL提出了一种课程引导的个性化子图联邦学习框架,通过逐步让模型接触通用及客户端特定图结构来防止早期过拟合,并利用细粒度结构指标改进客户端相似性估计。
English Summary: CUFL introduces a curriculum-guided personalized federated learning framework that prevents early overfitting by progressively exposing models to generic then client-specific graph structures, while improving client similarity estimation through fine-grained structural indicators.

Authors:Shumpei Takezaki, Ryoma Bise, Shinnosuke Matsuo
Title: NoiseCutMix: A Novel Data Augmentation Approach by Mixing Estimated Noise in Diffusion Models
Abstract:
In this study, we propose a novel data augmentation method that introduces the concept of CutMix into the generation process of diffusion models, thereby exploiting both the ability of diffusion models to generate natural and high-resolution images and the characteristic of CutMix, which combines features from two classes to create diverse augmented data. Representative data augmentation methods for combining images from multiple classes include CutMix and MixUp. However, techniques like CutMix often result in unnatural boundaries between the two images due to contextual differences. Therefore, in this study, we propose a method, called NoiseCutMix, to achieve natural, high-resolution image generation featuring the fused characteristics of two classes by partially combining the estimated noise corresponding to two different classes in a diffusion model. In the classification experiments, we verified the effectiveness of the proposed method by comparing it with conventional data augmentation techniques that combine multiple classes, random image generation using Stable Diffusion, and combinations of these methods. Our codes are available at: https://github.com/shumpei-takezaki/NoiseCutMix
本研究提出NoiseCutMix新方法,将CutMix概念融入扩散模型,通过融合两类噪声生成自然高清的双类特征图像,在分类实验中验证了其优于传统数据增强技术的有效性。
This study introduces NoiseCutMix, a novel data augmentation technique that integrates CutMix into diffusion models to generate natural, high-resolution images by blending noise from two classes, enhancing classification performance over traditional methods.

Authors:Ziyi Xia, Kun Luo, Hongjin Qian, Zheng Liu
Title: Open Data Synthesis For Deep Research
Abstract:
Large language models (LLMs) are increasingly expected to go beyond simple factual queries toward Deep Research-tasks that require decomposing questions into sub-problems, coordinating multi-step reasoning, and synthesizing evidence from diverse sources. We formalize Deep Research tasks with verifiable answers as Hierarchical Constraint Satisfaction Problems (HCSPs), which are fundamentally different from single-constraint, multi-hop, or flat CSP formulations. However, existing benchmarks (e.g., Natural Questions, HotpotQA) fail to capture this complexity, while recent synthetic datasets often introduce shortcut reasoning, knowledge leakage, or lack sufficient structural depth. To address this gap, we introduce InfoSeek, a scalable framework for synthesizing complex Deep Research tasks. InfoSeek uses a dual-agent system to recursively build a Research Tree from large-scale webpages, blurring intermediate nodes into valid sub-problems, and converting these trees into natural language questions that require traversing the full hierarchy. It also enables rapid scaling, yielding over 50K training examples, a curated test set, and reasoning trajectories generated via reject sampling. Experiments show that models trained on InfoSeek consistently outperform strong baselines. On a challenging benchmark BrowseComp-Plus, 3B LLMs optimized with InfoSeek surpass much larger 32B models and lightweight commercial APIs (e.g., Gemini2.5-Flash), while achieving performance comparable to stronger APIs (e.g., Gemini2.5-Pro). By preserving meta-information such as intermediate steps and retrieval labels, InfoSeek further supports advanced optimization strategies, including compound reward design and trajectory-level exploration. We provide our codes and datasets in \href{https://github.com/VectorSpaceLab/InfoSeek}{this repository}.
中文: 该研究提出了InfoSeek框架,通过从网络数据生成层次化问题来合成复杂的深度研究任务,显著提升了大语言模型在多步推理和证据综合方面的性能。
English: The study introduces InfoSeek, a scalable framework for synthesizing complex Deep Research tasks by generating hierarchical questions from web data, which significantly enhances the performance of large language models in multi-step reasoning and evidence synthesis.

Authors:Zhenxin Li, Shuibing He, Jiahao Guo, Xuechen Zhang, Xian-He Sun, Gang Chen
Title: CRouting: Reducing Expensive Distance Calls in Graph-Based Approximate Nearest Neighbor Search
Abstract:
Approximate nearest neighbor search (ANNS) is a crucial problem in information retrieval and AI applications. Recently, there has been a surge of interest in graph-based ANNS algorithms due to their superior efficiency and accuracy. However, the repeated computation of distances in high-dimensional spaces constitutes the primary time cost of graph-based methods. To accelerate the search, we propose a novel routing strategy named CRouting, which bypasses unnecessary distance computations by exploiting the angle distributions of high-dimensional vectors. CRouting is designed as a plugin to optimize existing graph-based search with minimal code modifications. Our experiments show that CRouting reduces the number of distance computations by up to 41.5% and boosts queries per second by up to 1.48$\times$ on two predominant graph indexes, HNSW and NSG. Code is publicly available at https://github.com/ISCS-ZJU/CRouting.
Chinese: 提出的CRouting策略通过利用高维向量角度分布来规避不必要的距离计算,显著加速了基于图的近似最近邻搜索,在HNSW和NSG索引上实现了最高41.5%的计算量减少和1.48倍的查询速度提升。
English: The proposed CRouting strategy accelerates graph-based approximate nearest neighbor search by reducing unnecessary distance computations through angle distribution analysis, achieving up to 41.5% fewer calculations and 1.48× faster query speeds on HNSW and NSG indexes.

Authors:Zhen Chen, Xingjian Luo, Kun Yuan, Jinlin Wu, Danny T. M. Chan, Nassir Navab, Hongbin Liu, Zhen Lei, Jiebo Luo
Title: SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding
Abstract:
Surgical video understanding is crucial for facilitating Computer-Assisted Surgery (CAS) systems. Despite significant progress in existing studies, two major limitations persist, including inadequate visual content perception and insufficient temporal awareness in surgical videos, and hinder the development of versatile CAS solutions. In this work, we propose the SurgLLM framework, an effective large multimodal model tailored for versatile surgical video understanding tasks with enhanced spatial focus and temporal awareness. Specifically, to empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM, by performing instrument-centric Masked Video Reconstruction (MV-Recon) and subsequent multimodal alignment. To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings. Moreover, to accommodate various understanding tasks of surgical videos without conflicts, we devise a Surgical Task Dynamic Ensemble to efficiently triage a query with optimal learnable parameters in our SurgLLM. Extensive experiments performed on diverse surgical video understanding tasks, including captioning, general VQA, and temporal VQA, demonstrate significant improvements over the state-of-the-art approaches, validating the effectiveness of our SurgLLM in versatile surgical video understanding. The source code is available at https://github.com/franciszchen/SurgLLM.
中文: SurgLLM框架提出了一种大型多模态模型,通过创新的预训练和调优策略增强手术视频的空间聚焦和时间感知能力,在多种理解任务中实现了卓越性能。
English: The SurgLLM framework introduces a large multimodal model that enhances spatial focus and temporal awareness in surgical video understanding, achieving superior performance across various tasks through innovative pretraining and tuning strategies.

Authors:Xunpeng Yi, Yibing Zhang, Xinyu Xiang, Qinglong Yan, Han Xu, Jiayi Ma
Title: LUT-Fuse: Towards Extremely Fast Infrared and Visible Image Fusion via Distillation to Learnable Look-Up Tables
Abstract:
Current advanced research on infrared and visible image fusion primarily focuses on improving fusion performance, often neglecting the applicability on real-time fusion devices. In this paper, we propose a novel approach that towards extremely fast fusion via distillation to learnable lookup tables specifically designed for image fusion, termed as LUT-Fuse. Firstly, we develop a look-up table structure that utilizing low-order approximation encoding and high-level joint contextual scene encoding, which is well-suited for multi-modal fusion. Moreover, given the lack of ground truth in multi-modal image fusion, we naturally proposed the efficient LUT distillation strategy instead of traditional quantization LUT methods. By integrating the performance of the multi-modal fusion network (MM-Net) into the MM-LUT model, our method achieves significant breakthroughs in efficiency and performance. It typically requires less than one-tenth of the time compared to the current lightweight SOTA fusion algorithms, ensuring high operational speed across various scenarios, even in low-power mobile devices. Extensive experiments validate the superiority, reliability, and stability of our fusion approach. The code is available at https://github.com/zyb5/LUT-Fuse.
中文: 本文提出LUT-Fuse方法,通过可学习查找表与蒸馏策略实现极速红外与可见光图像融合,在保持高性能的同时,其速度比现有轻量级算法快十倍以上,适用于各类移动设备。
English: This paper introduces LUT-Fuse, a novel method that uses learnable lookup tables via distillation to achieve extremely fast infrared and visible image fusion, significantly outperforming current lightweight algorithms in speed while maintaining high performance across various devices.

Authors:Wei Ao, Vishnu Naresh Boddeti
Title: CryptoFace: End-to-End Encrypted Face Recognition
Abstract:
Face recognition is central to many authentication, security, and personalized applications. Yet, it suffers from significant privacy risks, particularly arising from unauthorized access to sensitive biometric data. This paper introduces CryptoFace, the first end-to-end encrypted face recognition system with fully homomorphic encryption (FHE). It enables secure processing of facial data across all stages of a face-recognition process--feature extraction, storage, and matching--without exposing raw images or features. We introduce a mixture of shallow patch convolutional networks to support higher-dimensional tensors via patch-based processing while reducing the multiplicative depth and, thus, inference latency. Parallel FHE evaluation of these networks ensures near-resolution-independent latency. On standard face recognition benchmarks, CryptoFace significantly accelerates inference and increases verification accuracy compared to the state-of-the-art FHE neural networks adapted for face recognition. CryptoFace will facilitate secure face recognition systems requiring robust and provable security. The code is available at https://github.com/human-analysis/CryptoFace.
中文:CryptoFace是首个采用全同态加密的端到端加密人脸识别系统,可在确保面部数据安全处理的同时,相比现有方法显著加速推理并提高验证准确率。
English: CryptoFace is the first end-to-end encrypted face recognition system using fully homomorphic encryption, enabling secure processing of facial data while accelerating inference and improving verification accuracy compared to existing methods.

Authors:Renat Sergazinov, Shao-An Yin
Title: Chunked TabPFN: Exact Training-Free In-Context Learning for Long-Context Tabular Data
Abstract:
TabPFN v2 achieves better results than tree-based models on several tabular benchmarks, which is notable since tree-based models are usually the strongest choice for tabular data. However, it cannot handle more than 10K context tokens because transformers have quadratic computation and memory costs. Unlike existing approaches that rely on context compression, such as selecting representative samples via K-nearest neighbors (KNN), we introduce a tiled-block strategy to compute attention within the TabPFN framework. This design is compatible with standard GPU setups and, to the best of our knowledge, is the first to enable TabPFN to process long contexts without any pre-processing. We demonstrate the effectiveness of our approach on the standard TabArena benchmark, with code available at https://github.com/mrsergazinov/chunk_tabpfn.
中文: TabPFN v2在多个表格数据基准测试中优于基于树的模型,但受限于Transformer的计算瓶颈,因此引入了分块策略,使其无需预处理即可处理长上下文,并在TabArena基准测试中验证了有效性。
English: TabPFN v2 surpasses tree-based models in tabular data benchmarks but is limited by transformers' computational constraints, prompting the introduction of a tiled-block strategy that enables handling long contexts without pre-processing and demonstrates effectiveness on the TabArena benchmark.

Authors:Ezra Erives, Bowen Jing, Peter Holderrieth, Tommi Jaakkola
Title: Continuously Tempered Diffusion Samplers
Abstract:
Annealing-based neural samplers seek to amortize sampling from unnormalized distributions by training neural networks to transport a family of densities interpolating from source to target. A crucial design choice in the training phase of such samplers is the proposal distribution by which locations are generated at which to evaluate the loss. Previous work has obtained such a proposal distribution by combining a partially learned transport with annealed Langevin dynamics. However, isolated modes and other pathological properties of the annealing path imply that such proposals achieve insufficient exploration and thereby lower performance post training. To remedy this, we propose continuously tempered diffusion samplers, which leverage exploration techniques developed in the context of molecular dynamics to improve proposal distributions. Specifically, a family of distributions across different temperatures is introduced to lower energy barriers at higher temperatures and drive exploration at the lower temperature of interest. We empirically validate improved sampler performance driven by extended exploration. Code is available at https://github.com/eje24/ctds.
中文: 退火神经采样器因提议分布的病理特性而面临探索不足的问题,连续调温扩散采样器通过引入多温度分布来增强探索,从而提升了采样性能。
English: Annealing-based neural samplers face exploration limitations due to pathological properties in their proposal distributions, which are addressed by continuously tempered diffusion samplers that introduce multi-temperature distributions to enhance exploration and improve performance.

Authors:Hikmat Khan, Syed Farhan Alam Zaidi, Pir Masoom Shah, Kiruthika Balakrishnan, Rabia Khan, Muhammad Waqas, Jia Wu
Title: MorphGen: Morphology-Guided Representation Learning for Robust Single-Domain Generalization in Histopathological Cancer Classification
Abstract:
Domain generalization in computational histopathology is hindered by heterogeneity in whole slide images (WSIs), caused by variations in tissue preparation, staining, and imaging conditions across institutions. Unlike machine learning systems, pathologists rely on domain-invariant morphological cues such as nuclear atypia (enlargement, irregular contours, hyperchromasia, chromatin texture, spatial disorganization), structural atypia (abnormal architecture and gland formation), and overall morphological atypia that remain diagnostic across diverse settings. Motivated by this, we hypothesize that explicitly modeling biologically robust nuclear morphology and spatial organization will enable the learning of cancer representations that are resilient to domain shifts. We propose MorphGen (Morphology-Guided Generalization), a method that integrates histopathology images, augmentations, and nuclear segmentation masks within a supervised contrastive learning framework. By aligning latent representations of images and nuclear masks, MorphGen prioritizes diagnostic features such as nuclear and morphological atypia and spatial organization over staining artifacts and domain-specific features. To further enhance out-of-distribution robustness, we incorporate stochastic weight averaging (SWA), steering optimization toward flatter minima. Attention map analyses revealed that MorphGen primarily relies on nuclear morphology, cellular composition, and spatial cell organization within tumors or normal regions for final classification. Finally, we demonstrate resilience of the learned representations to image corruptions (such as staining artifacts) and adversarial attacks, showcasing not only OOD generalization but also addressing critical vulnerabilities in current deep learning systems for digital pathology. Code, datasets, and trained models are available at: https://github.com/hikmatkhan/MorphGen
中文: MorphGen通过将核形态和空间组织整合到监督对比学习框架中,增强了组织病理学中的域泛化能力,提高了对域偏移和对抗攻击的鲁棒性。
English: MorphGen enhances domain generalization in histopathology by integrating nuclear morphology and spatial organization into a supervised contrastive learning framework, improving resilience to domain shifts and adversarial attacks.

Authors:Ghassen Baklouti, Maxime Zanella, Ismail Ben Ayed
Title: Language-Aware Information Maximization for Transductive Few-Shot CLIP
Abstract:
Transductive few-shot learning has triggered an abundant literature focusing on vision-only models, but is still at a nascent stage within the recent context of foundational vision-language models (VLMs). Only a few recent methods addressed the problem, pointing to the potential of tranduction in VLMs and to the need for VLM-tailored methods. Building on this momentum, we leverage information-theoretic concepts and recent progress in parameter-efficient fine-tuning (PEFT), developing a highly competitive transductive few-shot CLIP method. Specifically, we introduce a novel Language-aware Information MaximizatiOn (LIMO) loss integrating three complementary terms: (i) the mutual information between the vision inputs and the textual class descriptions; (ii) a Kullback-Leibler (KL) divergence penalizing deviation of the network's probabilistic outputs from the text-driven zero-shot predictions; and (iii) a standard cross-entropy loss based on the labeled shots. Furthermore, we challenge the commonly followed fine-tuning practices in the context of transductive few-shot learning, and explore PEFT strategies, completely overlooked in this context. Surprisingly, we observe substantial boosts in performances, which points to the potential of adapting a subset of the model's parameters in the transductive few-shot setting. We report comprehensive evaluations, which show that LIMO outperforms the very recent transductive few-shot CLIP methods by a large margin and yields significant gains over the best-performing inductive methods. Our code is publicly available at:\[ \href{https://github.com/ghassenbaklouti/LIMO}{\text{here}} \]
中文: 本文提出LIMO方法,通过信息论框架和参数高效微调技术,在视觉语言模型的转导式小样本学习中实现了突破性性能提升。
English: This paper introduces LIMO, a novel transductive few-shot learning method for vision-language models that combines information-theoretic principles with parameter-efficient fine-tuning to significantly outperform existing approaches.

Authors:Younggun Kim, Sirnam Swetha, Fazil Kagdi, Mubarak Shah
Title: Safe-LLaVA: A Privacy-Preserving Vision-Language Dataset and Benchmark for Biometric Safety
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks. However, these models often infer and reveal sensitive biometric attributes - such as race, gender, age, body weight, and eye color - even when such information is not explicitly requested. This raises critical concerns, particularly in real-world applications and socially-sensitive domains. Despite increasing awareness, no publicly available dataset or benchmark exists to comprehensively evaluate or mitigate biometric leakage in MLLMs. To address this gap, we introduce PRISM (Privacy-aware Evaluation of Responses in Sensitive Modalities), a new benchmark designed to assess MLLMs on two fronts: (1) refuse biometric-related queries and (2) implicit biometric leakage in general responses while maintaining semantic faithfulness. Further, we conduct a detailed audit of the widely used LLaVA datasets and uncover extensive biometric leakage across pretraining and instruction data. To address this, we present Safe-LLaVA dataset, the first privacy-preserving MLLM training dataset constructed by systematically removing explicit and implicit biometric information from LLaVA dataset. Our evaluations on PRISM reveal biometric leakages across MLLMs for different attributes, highlighting the detailed privacy-violations. We also fine-tune a model on Safe-LLaVA dataset and show that it substantially reduces the biometric leakages. Together, Safe-LLaVA & PRISM set a new standard for privacy-aligned development and evaluation of MLLMs. The Safe-LLaVA dataset & PRISM benchmark are publicly available at https://huggingface.co/datasets/kyh9191/Safe-LLaVA, and the source code is available at https://github.com/Kimyounggun99/Safe-LLaVA.git.
中文: PRISM基准和Safe-LLaVA数据集的推出旨在评估和减少多模态大语言模型中的生物特征信息泄露,通过拒绝生物特征查询和降低隐含数据暴露来解决隐私问题,同时保持回答质量。
English: The PRISM benchmark and Safe-LLaVA dataset are introduced to evaluate and mitigate biometric information leakage in multimodal large language models, addressing privacy concerns by enabling refusal of biometric queries and reducing implicit data exposure while maintaining response quality.

Authors:Faizan Farooq Khan, Vladan Stojnić, Zakaria Laskar, Mohamed Elhoseiny, Giorgos Tolias
Title: Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders
Abstract:
This work explores text-to-image retrieval for queries that specify or describe a semantic category. While vision-and-language models (VLMs) like CLIP offer a straightforward open-vocabulary solution, they map text and images to distant regions in the representation space, limiting retrieval performance. To bridge this modality gap, we propose a two-step approach. First, we transform the text query into a visual query using a generative diffusion model. Then, we estimate image-to-image similarity with a vision model. Additionally, we introduce an aggregation network that combines multiple generated images into a single vector representation and fuses similarity scores across both query modalities. Our approach leverages advancements in vision encoders, VLMs, and text-to-image generation models. Extensive evaluations show that it consistently outperforms retrieval methods relying solely on text queries. Source code is available at: https://github.com/faixan-khan/cletir
中文: 本研究提出了一种新颖的两步文本-图像检索方法,先通过扩散模型将文本查询转换为视觉表示,再融合跨模态相似度分数,显著超越了仅依赖文本的检索方法。
English: This study introduces a novel two-step method for text-to-image retrieval that first converts text queries into visual representations using a diffusion model and then fuses similarity scores across modalities, significantly outperforming text-only approaches.

Authors:Manish Shukla
Title: Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems
Abstract:
Agentic artificial intelligence (AI) -- multi-agent systems that combine large language models with external tools and autonomous planning -- are rapidly transitioning from research laboratories into high-stakes domains. Our earlier "Basic" paper introduced a five-axis framework and proposed preliminary metrics such as goal drift and harm reduction but did not provide an algorithmic instantiation or empirical evidence. This "Advanced" sequel fills that gap. First, we revisit recent benchmarks and industrial deployments to show that technical metrics still dominate evaluations: a systematic review of 84 papers from 2023--2025 found that 83% report capability metrics while only 30% consider human-centred or economic axes [2]. Second, we formalise an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm that normalises heterogeneous metrics, applies per-axis exponentially weighted moving-average thresholds and performs joint anomaly detection via the Mahalanobis distance [7]. Third, we conduct simulations and real-world experiments. AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9% compared with static thresholds. We present a comparison table and ROC/PR curves, and we reanalyse case studies to surface missing metrics. Code, data and a reproducibility checklist accompany this paper to facilitate replication. The code supporting this work is available at https://github.com/Manishms18/Adaptive-Multi-Dimensional-Monitoring.
Chinese: 本进阶研究提出自适应多维度监测(AMDM)算法,通过规范化指标和动态阈值显著提升了智能体人工智能系统的异常检测性能,填补了先前研究的空白并提供了实证支持。
English: This advanced paper introduces an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm that significantly improves anomaly detection speed and accuracy in agentic AI systems, addressing gaps from prior research through formalization, simulations, and real-world validation.

Authors:Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin
Title: Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination
Abstract:
Capability evaluation of large language models (LLMs) is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. We further performed a comparative analysis with previous longitudinal studies that reported significant post-cutoff performance decay using directly retrieved questions based on public data. we hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization, which effectively serves a mitigation strategy against benchmark contamination. We fully open source our code and dataset to aid reproducibility and advocate for a paradigm shift that prioritize reasoning-driven synthesis to construct benchmarks over simply collecting newly released questions periodically.
中文: 本研究提出一种可扩展框架,通过arXiv论文生成研究级问题,发现大型语言模型在知识截止日期附近未出现显著性能衰退,表明多步推理能超越单纯记忆,有效缓解基准测试污染。
English: This study introduces a scalable framework to generate research-level questions from arXiv papers, finding no significant performance decay in large language models near their knowledge cutoff dates, which suggests that multi-step reasoning mitigates benchmark contamination by transcending mere memorization.

Authors:Lifan Yuan, Weize Chen, Yuchen Zhang, Ganqu Cui, Hanbin Wang, Ziming You, Ning Ding, Zhiyuan Liu, Maosong Sun, Hao Peng
Title: From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones
Abstract:
Does RL teach LLMs genuinely new skills, or does it merely activate existing ones? This question lies at the core of ongoing debates about the role of RL in LLM post-training. On one side, strong empirical results can be achieved with RL even without preceding supervised finetuning; on the other, critics argue that RL contributes little beyond reweighting existing reasoning strategies. This work provides concrete evidence that LLMs can acquire genuinely new skills during RL by composing existing ones, mirroring one of the central mechanisms by which humans acquire new cognitive skills. To mitigate data contamination and other confounding factors, and to allow precise control over task complexity, we develop a synthetic framework for our investigation. Specifically, we define a skill as the ability to infer the output of a string transformation function f(x) given x. When an LLM has already learned f and g prior to RL, our experiments reveal that RL enables it to learn unseen compositions of them h(x)=g(f(x)). Further, this compositional ability generalizes to more difficult problems such as compositions of >2 functions unseen during RL training. Surprisingly, our experiments show that compositional skill acquired on a source task transfers to a different target task. This transfer happens even without compositional training on the target, requiring only prior knowledge of the target's atomic skills. Our qualitative analysis shows that RL fundamentally changes the reasoning behaviors of the models. In contrast, next-token training with the same data yields none of these findings. Our systematic experiments provide fresh insights into LLM learning, suggesting the value of first building base models with basic skills, then using RL to incentivize advanced, generalizable skills for complex problems.
中文: 本研究证明强化学习能使大语言模型通过组合现有技能获得真正的新能力,从根本上改变其推理行为,并实现跨任务的技能迁移而无需额外训练。
English: This research demonstrates that reinforcement learning enables large language models to acquire genuinely new compositional skills by combining existing ones, fundamentally altering their reasoning behaviors and enabling skill transfer across tasks without additional training.

Authors:Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu, Maosong Sun
Title: StateX: Enhancing RNN Recall via Post-training State Expansion
Abstract:
While Transformer-based models have demonstrated remarkable language modeling performance, their high complexities result in high costs when processing long contexts. In contrast, recurrent neural networks (RNNs) such as linear attention and state space models have gained popularity due to their constant per-token complexities. However, these recurrent models struggle with tasks that require accurate recall of contextual information from long contexts, because all contextual information is compressed into a constant-size recurrent state. Previous works have shown that recall ability is positively correlated with the recurrent state size, yet directly training RNNs with larger recurrent states results in high training costs. In this paper, we introduce StateX, a training pipeline for efficiently expanding the states of pre-trained RNNs through post-training. For two popular classes of RNNs, linear attention and state space models, we design post-training architectural modifications to scale up the state size with no or negligible increase in model parameters. Experiments on models up to 1.3B parameters demonstrate that StateX efficiently enhances the recall and in-context learning ability of RNNs without incurring high post-training costs or compromising other capabilities.
Chinese: StateX是一种后训练流程,能有效扩展预训练循环神经网络(如线性注意力和状态空间模型)的循环状态大小,无需显著增加参数或成本即可提升其回忆和上下文学习能力。
English: StateX is a post-training pipeline that efficiently expands the recurrent state size of pre-trained RNNs like linear attention and state space models, enhancing their recall and in-context learning abilities without significantly increasing parameters or costs.

Authors:Brian S. Lin, Jiaxin Yuan, Zihan Zhou, Shouli Wang, Shuo Wang, Cunliang Kong, Qi Shi, Yuxuan Li, Liner Yang, Zhiyuan Liu, Maosong Sun
Title: On LLM-Based Scientific Inductive Reasoning Beyond Equations
Abstract:
As large language models (LLMs) increasingly exhibit human-like capabilities, a fundamental question emerges: How can we enable LLMs to learn the underlying patterns from limited examples in entirely novel environments and apply them effectively? This question is central to the ability of LLMs in inductive reasoning. Existing research on LLM-based inductive reasoning can be broadly categorized based on whether the underlying rules are expressible via explicit mathematical equations. However, many recent studies in the beyond-equations category have emphasized rule design without grounding them in specific scenarios. Inspired by the parallels between inductive reasoning and human scientific discovery, we propose the task of LLM-Based Scientific Inductive Reasoning Beyond Equations and introduce a new benchmark, SIRBench-V1, to evaluate the inductive reasoning abilities of LLMs in scientific settings. Our experimental results show that current LLMs still struggle with this task, underscoring its difficulty and the need for further advancement in this area.
中文: 大语言模型在全新环境中从有限样本学习规律面临挑战,为此我们提出了SIRBench-V1基准来评估其在非方程场景下的科学归纳推理能力,实验表明现有模型对此仍力不从心。
English: Large language models face challenges in learning patterns from limited examples in novel environments, prompting the introduction of a new benchmark, SIRBench-V1, to evaluate their scientific inductive reasoning beyond equations, where current models still struggle.

Authors:Yuqin Cao, Xiongkuo Min, Yixuan Gao, Wei Sun, Zicheng Zhang, Jinliang Han, Guangtao Zhai
Title: XGC-AVis: Towards Audio-Visual Content Understanding with a Multi-Agent Collaborative System
Abstract:
In this paper, we propose XGC-AVis, a multi-agent framework that enhances the audio-video temporal alignment capabilities of multimodal large models (MLLMs) and improves the efficiency of retrieving key video segments through 4 stages: perception, planning, execution, and reflection. We further introduce XGC-AVQuiz, the first benchmark aimed at comprehensively assessing MLLMs' understanding capabilities in both real-world and AI-generated scenarios. XGC-AVQuiz consists of 2,685 question-answer pairs across 20 tasks, with two key innovations: 1) AIGC Scenario Expansion: The benchmark includes 2,232 videos, comprising 1,102 professionally generated content (PGC), 753 user-generated content (UGC), and 377 AI-generated content (AIGC). These videos cover 10 major domains and 53 fine-grained categories. 2) Quality Perception Dimension: Beyond conventional tasks such as recognition, localization, and reasoning, we introduce a novel quality perception dimension. This requires MLLMs to integrate low-level sensory capabilities with high-level semantic understanding to assess audio-visual quality, synchronization, and coherence. Experimental results on XGC-AVQuiz demonstrate that current MLLMs struggle with quality perception and temporal alignment tasks. XGC-AVis improves these capabilities without requiring additional training, as validated on two benchmarks.
中文: 本文提出XGC-AVis多智能体框架,通过感知、规划、执行和反思四阶段提升多模态大模型的音视频时序对齐能力与关键片段检索效率,并创建首个包含2,685个问答对的XGC-AVQuiz基准,涵盖PGC/UGC/AIGC视频和创新的质量感知维度,实验表明当前模型在质量感知任务上存在不足而XGC-AVis能有效提升相关能力。
English: This paper introduces XGC-AVis, a multi-agent framework that enhances audio-video temporal alignment and retrieval efficiency in multimodal large models through four stages, along with XGC-AVQuiz, the first benchmark featuring 2,685 QA pairs across diverse video types to evaluate model capabilities including novel quality perception tasks.

Authors:Zhichao Ma, Fan Huang, Lu Zhao, Fengjun Guo, Guangtao Zhai, Xiongkuo Min
Title: DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment
Abstract:
Document image quality assessment (DIQA) is an important component for various applications, including optical character recognition (OCR), document restoration, and the evaluation of document image processing systems. In this paper, we introduce a subjective DIQA dataset DIQA-5000. The DIQA-5000 dataset comprises 5,000 document images, generated by applying multiple document enhancement techniques to 500 real-world images with diverse distortions. Each enhanced image was rated by 15 subjects across three rating dimensions: overall quality, sharpness, and color fidelity. Furthermore, we propose a specialized no-reference DIQA model that exploits document layout features to maintain quality perception at reduced resolutions to lower computational cost. Recognizing that image quality is influenced by both low-level and high-level visual features, we designed a feature fusion module to extract and integrate multi-level features from document images. To generate multi-dimensional scores, our model employs independent quality heads for each dimension to predict score distributions, allowing it to learn distinct aspects of document image quality. Experimental results demonstrate that our method outperforms current state-of-the-art general-purpose IQA models on both DIQA-5000 and an additional document image dataset focused on OCR accuracy.
中文: 本文提出了包含5000张增强文档图像及主观评分的DIQA-5000数据集,并设计了一种利用文档布局特征和多层级特征融合的专业无参考质量评估模型,其性能优于现有方法。
English: This paper introduces the DIQA-5000 dataset containing 5,000 enhanced document images with subjective ratings and proposes a specialized no-reference document image quality assessment model that outperforms existing methods by leveraging layout features and multi-level feature fusion.

Authors:Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Liao, Song Jiang
Title: VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results
Abstract:
This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.
中文: VQualA 2025挑战赛通过构建包含数千个视觉质量比较任务的新基准,评估并提升了大型多模态模型在开放域质量推理方面的能力,其中五个模型展现出新兴的评估潜力,推动了可解释质量评估系统的发展。
English: The VQualA 2025 Challenge at ICCV 2025 introduces a novel benchmark to evaluate and advance large multimodal models' capabilities in open-ended visual quality reasoning through thousands of comparison tasks, with five models demonstrating emerging proficiency in holistic quality assessment.

Authors:Dasong Li, Sizhuo Ma, Hang Hua, Wenjie Li, Jian Wang, Chris Wei Zhou, Fengbin Guan, Xin Li, Zihao Yu, Yiting Lu, Ru-Ling Liao, Yan Ye, Zhibo Chen, Wei Sun, Linhan Cao, Yuqin Cao, Weixia Zhang, Wen Wen, Kaiwei Zhang, Zijian Chen, Fangfang Lu, Xiongkuo Min, Guangtao Zhai, Erjia Xiao, Lingfeng Zhang, Zhenjie Su, Hao Cheng, Yu Liu, Renjing Xu, Long Chen, Xiaoshuai Hao, Zhenpeng Zeng, Jianqin Wu, Xuxu Wang, Qian Yu, Bo Hu, Weiwei Wang, Pinxin Liu, Yunlong Tang, Luchuan Song, Jinxi He, Jiaru Wu, Hanjia Lyu
Title: VQualA 2025 Challenge on Engagement Prediction for Short Videos: Methods and Results
Abstract:
This paper presents an overview of the VQualA 2025 Challenge on Engagement Prediction for Short Videos, held in conjunction with ICCV 2025. The challenge focuses on understanding and modeling the popularity of user-generated content (UGC) short videos on social media platforms. To support this goal, the challenge uses a new short-form UGC dataset featuring engagement metrics derived from real-world user interactions. This objective of the Challenge is to promote robust modeling strategies that capture the complex factors influencing user engagement. Participants explored a variety of multi-modal features, including visual content, audio, and metadata provided by creators. The challenge attracted 97 participants and received 15 valid test submissions, contributing significantly to progress in short-form UGC video engagement prediction.
中文: VQualA 2025挑战赛基于新型多模态数据集推进短视频互动预测研究,通过97名参赛者对真实用户互动数据的建模分析,显著提升了用户生成内容的参与度预测能力。
English: The VQualA 2025 Challenge at ICCV 2025 advances engagement prediction for short UGC videos using a new multi-modal dataset, attracting 97 participants to develop robust models based on real user interaction metrics.

Authors:Qixin Zhang, Yan Sun, Can Jin, Xikun Zhang, Yao Shu, Puning Zhao, Li Shen, Dacheng Tao
Title: Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives
Abstract:
In this paper, we present two effective policy learning algorithms for multi-agent online coordination(MA-OC) problem. The first one, \texttt{MA-SPL}, not only can achieve the optimal $(1-\frac{c}{e})$-approximation guarantee for the MA-OC problem with submodular objectives but also can handle the unexplored $α$-weakly DR-submodular and $(γ,β)$-weakly submodular scenarios, where $c$ is the curvature of the investigated submodular functions, $α$ denotes the diminishing-return(DR) ratio and the tuple $(γ,β)$ represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters $α,γ,β$ inherent in the \texttt{MA-SPL} algorithm, we further introduce the second online algorithm named \texttt{MA-MPL}. This \texttt{MA-MPL} algorithm is entirely \emph{parameter-free} and simultaneously can maintain the same approximation ratio as the first \texttt{MA-SPL} algorithm. The core of our \texttt{MA-SPL} and \texttt{MA-MPL} algorithms is a novel continuous-relaxation technique termed as \emph{policy-based continuous extension}. Compared with the well-established \emph{multi-linear extension}, a notable advantage of this new \emph{policy-based continuous extension} is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objectives. Finally, extensive simulations are conducted to validate the effectiveness of our proposed algorithms.
本文提出了两种多智能体在线协调策略学习算法:MA-SPL算法能在多种次模目标函数下实现最优近似保证,而无需参数的MA-MPL算法在保持相同近似比的同时完全摆脱了对未知参数的依赖。
This paper introduces two policy learning algorithms for multi-agent online coordination: MA-SPL with optimal approximation guarantees for various submodular objectives, and parameter-free MA-MPL maintaining comparable performance while eliminating dependency on unknown parameters.

Authors:Botai Yuan, Yutian Zhou, Yingjie Wang, Fushuo Huo, Yongcheng Jing, Li Shen, Ying Wei, Zhiqi Shen, Ziwei Liu, Tianwei Zhang, Jie Yang, Dacheng Tao
Title: EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models
Abstract:
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety. We study sycophancy -- models' tendency to uncritically echo user-provided information -- in high-stakes clinical settings. We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs. It contains 2,122 images across 18 departments and 20 modalities with 90 prompts that simulate biased inputs from patients, medical students, and physicians. We evaluate medical-specific, open-source, and proprietary LVLMs. All exhibit substantial sycophancy; the best proprietary model (Claude 3.7 Sonnet) still shows 45.98% sycophancy, and GPT-4.1 reaches 59.15%. Many medical-specific models exceed 95% sycophancy despite only moderate accuracy. Fine-grained analyses by bias type, department, perceptual granularity, and modality identify factors that increase susceptibility. We further show that higher data quality/diversity and stronger domain knowledge reduce sycophancy without harming unbiased accuracy. EchoBench also serves as a testbed for mitigation: simple prompt-level interventions (negative prompting, one-shot, few-shot) produce consistent reductions and motivate training- and decoding-time strategies. Our findings highlight the need for robust evaluation beyond accuracy and provide actionable guidance toward safer, more trustworthy medical LVLMs.
中文摘要:现有医学大视觉语言模型基准过度关注准确率而忽视可靠性与安全性,为此我们开发EchoBench评估模型在临床环境中盲从用户偏见的“迎合行为”,发现所有测试模型均存在严重问题,并提出了降低风险的有效干预措施。
English Summary: Current medical LVLM benchmarks prioritize accuracy but neglect reliability and safety, prompting the creation of EchoBench to measure sycophancy—where models uncritically echo biased inputs—revealing alarming rates across all tested models and offering mitigation strategies for safer deployment.

Authors:Yuchen Wu, Liang Ding, Li Shen, Dacheng Tao
Title: Robust Knowledge Editing via Explicit Reasoning Chains for Distractor-Resilient Multi-Hop QA
Abstract:
Large language models (LLMs) encode vast amounts of world knowledge but remain static once trained, making the timely integration of emerging facts prohibitively expensive via full retraining. Knowledge-editing techniques have thus emerged to inject or overwrite specific facts into LLMs, yet they either over-rely on superficial cues or incur complex, iterative pipelines that collapse under noisy, multi-hop conditions. We introduce Reason-KE, an end-to-end reasoning-chain-based editing framework that steers a pretrained LLM through four structured stages-fact acknowledgment, relevance determination, selective application, and final reasoning-to filter distractors in a single pass. Trained on MQuAKE-CF with up to four irrelevant facts, Reason-KE elevates Qwen2.5-7B's multi-hop QA accuracy to 90.2% while suffering merely a 6.3% drop under heavy distraction and <1% when answers are leaked. Our quantitative analysis confirms Reason-KE's resilience and efficiency, establishing a new state-of-the-art for reliable LLM knowledge updates.
中文摘要:Reason-KE提出了一种端到端的推理链编辑框架,通过结构化推理阶段增强大语言模型整合新知识的能力,在实现90.2%多跳问答准确率的同时保持了对干扰信息的强鲁棒性。
English Summary: Reason-KE introduces an end-to-end reasoning framework that enhances large language models' ability to integrate new facts through structured reasoning stages, achieving 90.2% multi-hop QA accuracy while maintaining resilience against distractions.

Authors:Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Title: Scaling Generalist Data-Analytic Agents
Abstract:
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.
中文: DataMind提出了一种可扩展的数据合成与智能体训练方法,旨在构建通用数据分析智能体,通过解决数据不足和不稳定多轮推理等关键难题,其7B和14B模型在多项基准测试中实现了最优性能。
English: DataMind introduces a scalable data synthesis and agent training method to develop generalist data-analytic agents, overcoming challenges like insufficient data and unstable multi-turn reasoning, achieving state-of-the-art performance on benchmarks with its 7B and 14B models.

Authors:Yihong Liu, Junyi Li, Wayne Xin Zhao, Hongyu Lu, Ji-Rong Wen
Title: Experience-Guided Reflective Co-Evolution of Prompts and Heuristics for Automatic Algorithm Design
Abstract:
Combinatorial optimization problems are traditionally tackled with handcrafted heuristic algorithms, which demand extensive domain expertise and significant implementation effort. Recent progress has highlighted the potential of automatic heuristics design powered by large language models (LLMs), enabling the automatic generation and refinement of heuristics. These approaches typically maintain a population of heuristics and employ LLMs as mutation operators to evolve them across generations. While effective, such methods often risk stagnating in local optima. To address this issue, we propose the Experience-Guided Reflective Co-Evolution of Prompt and Heuristics (EvoPH) for automatic algorithm design, a novel framework that integrates the island migration model with the elites selection algorithm to simulate diverse heuristics populations. In EvoPH, prompts are co-evolved with heuristic algorithms, guided by performance feedback. We evaluate our framework on two problems, i.e., Traveling Salesman Problem and Bin Packing Problem. Experimental results demonstrate that EvoPH achieves the lowest relative error against optimal solutions across both datasets, advancing the field of automatic algorithm design with LLMs.
Chinese: EvoPH框架通过岛屿迁移模型与精英选择算法协同进化提示与启发式方法,有效避免局部最优,在旅行商和装箱问题上实现了最低误差率。
English: The EvoPH framework co-evolves prompts and heuristics using an island migration model and elite selection to overcome local optima, achieving the lowest error rates on the Traveling Salesman and Bin Packing problems.

Authors:Xinping Lei, Tong Zhou, Yubo Chen, Kang Liu, Jun Zhao
Title: MotivGraph-SoIQ: Integrating Motivational Knowledge Graphs and Socratic Dialogue for Enhanced LLM Ideation
Abstract:
Large Language Models (LLMs) hold substantial potential for accelerating academic ideation but face critical challenges in grounding ideas and mitigating confirmation bias for further refinement. We propose integrating motivational knowledge graphs and socratic dialogue to address these limitations in enhanced LLM ideation (MotivGraph-SoIQ). This novel framework provides essential grounding and practical idea improvement steps for LLM ideation by integrating a Motivational Knowledge Graph (MotivGraph) with a Q-Driven Socratic Ideator. The MotivGraph structurally stores three key node types(problem, challenge and solution) to offer motivation grounding for the LLM ideation process. The Ideator is a dual-agent system utilizing Socratic questioning, which facilitates a rigorous refinement process that mitigates confirmation bias and improves idea quality across novelty, experimental rigor, and motivational rationality dimensions. On the ICLR25 paper topics dataset, MotivGraph-SoIQ exhibits clear advantages over existing state-of-the-art approaches across LLM-based scoring, ELO ranking, and human evaluation metrics.
中文摘要:提出的MotivGraph-SoIQ框架通过结合动机知识图谱和苏格拉底式提问,为LLM构思提供基础并减少确认偏差,在各项评估中展现出卓越性能。
English Summary: The proposed MotivGraph-SoIQ framework enhances LLM ideation by integrating motivational knowledge graphs and Socratic questioning to ground ideas and reduce confirmation bias, demonstrating superior performance in evaluations.

Authors:Wei Wei, Zheng Lin, Xihui Liu, Hongyang Du, Dusit Niyato, Xianhao Chen
Title: Optimizing Split Federated Learning with Unstable Client Participation
Abstract:
To enable training of large artificial intelligence (AI) models at the network edge, split federated learning (SFL) has emerged as a promising approach by distributing computation between edge devices and a server. However, while unstable network environments pose significant challenges to SFL, prior schemes often overlook such an effect by assuming perfect client participation, rendering them impractical for real-world scenarios. In this work, we develop an optimization framework for SFL with unstable client participation. We theoretically derive the first convergence upper bound for SFL with unstable client participation by considering activation uploading failures, gradient downloading failures, and model aggregation failures. Based on the theoretical results, we formulate a joint optimization problem for client sampling and model splitting to minimize the upper bound. We then develop an efficient solution approach to solve the problem optimally. Extensive simulations on EMNIST and CIFAR-10 demonstrate the superiority of our proposed framework compared to existing benchmarks.
中文: 本研究针对客户端参与不稳定的分割联邦学习,开发了一个优化框架,通过联合优化客户端采样和模型分割来最小化收敛上界,并在模拟中展现出优于现有方案的性能。
English: This study develops an optimization framework for split federated learning that addresses unstable client participation by jointly optimizing client sampling and model splitting to minimize convergence bounds, demonstrating superior performance in simulations.

Authors:Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan
Title: Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels
Abstract:
Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model's knowledge remains underexplored, limiting our ability to control knowledge change behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families. Surprisingly, models fine-tuned on 1,920 samples perform up to 14% worse than those fine-tuned on only 240 samples. Furthermore, varying the level of knowledge mastery in the fine-tuning data leads to performance fluctuations of over 12%. To investigate these effects, we analyze model behavior at both the token and parameter levels. Our analysis reveals that up to 90% of parameter updates during SFT do not contribute to knowledge enhancement. Restoring these updates can improve performance on the CBQA task, depending on the characteristics of the fine-tuning data. These insights offer practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.
中文: 监督微调可能意外削弱大语言模型的知识,闭卷问答性能下降高达14%,大部分参数更新无助于知识增强,但根据微调数据特性恢复这些更新可提升效果。
English: Supervised fine-tuning (SFT) can unexpectedly degrade large language models' knowledge, as shown by up to 14% performance drop in closed-book question answering, with most parameter updates failing to enhance knowledge, but restoring them can improve results depending on the fine-tuning data.

Authors:Dinura Dissanayake, Ahmed Heakl, Omkar Thawakar, Noor Ahsan, Ritesh Thawkar, Ketan More, Jean Lahoud, Rao Anwer, Hisham Cholakkal, Ivan Laptev, Fahad Shahbaz Khan, Salman Khan
Title: How Good are Foundation Models in Step-by-Step Embodied Reasoning?
Abstract:
Embodied agents operating in the physical world must make decisions that are not only effective but also safe, spatially coherent, and grounded in context. While recent advances in large multimodal models (LMMs) have shown promising capabilities in visual understanding and language generation, their ability to perform structured reasoning for real-world embodied tasks remains underexplored. In this work, we aim to understand how well foundation models can perform step-by-step reasoning in embodied environments. To this end, we propose the Foundation Model Embodied Reasoning (FoMER) benchmark, designed to evaluate the reasoning capabilities of LMMs in complex embodied decision-making scenarios. Our benchmark spans a diverse set of tasks that require agents to interpret multimodal observations, reason about physical constraints and safety, and generate valid next actions in natural language. We present (i) a large-scale, curated suite of embodied reasoning tasks, (ii) a novel evaluation framework that disentangles perceptual grounding from action reasoning, and (iii) empirical analysis of several leading LMMs under this setting. Our benchmark includes over 1.1k samples with detailed step-by-step reasoning across 10 tasks and 8 embodiments, covering three different robot types. Our results highlight both the potential and current limitations of LMMs in embodied reasoning, pointing towards key challenges and opportunities for future research in robot intelligence. Our data and code will be made publicly available.
Chinese: 本研究提出FoMER基准,用于评估大型多模态模型在具身环境中的逐步推理能力,揭示了其在多模态理解和安全行动生成任务中的潜力与局限。
English: This study introduces the Foundation Model Embodied Reasoning (FoMER) benchmark to assess large multimodal models' step-by-step reasoning abilities in embodied environments, revealing their potential and limitations in tasks requiring multimodal interpretation and safe action generation.

Authors:Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang
Title: Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Abstract:
While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.
Chinese: DECS框架通过解耦的令牌级奖励机制和创新课程批量调度策略,解决了大型推理模型过度思考的问题,在多个基准测试中实现了推理令牌数量减少50%以上,同时保持甚至提升了模型性能。
English: The DECS framework addresses the issue of overthinking in large reasoning models by introducing a decoupled token-level reward mechanism and curriculum batch scheduling, achieving over 50% reduction in reasoning tokens without compromising performance across multiple benchmarks.

Authors:Junying Wang, Zicheng Zhang, Ye Shen, Yalun Wu, Yingji Liang, Yijin Guo, Farong Wen, Wenzhe Li, Xuezhi Zhao, Qi Jia, Guangtao Zhai
Title: Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs
Abstract:
High-quality, multi-modal benchmarks are crucial for advancing scientific reasoning in large models yet their manual creation is costly and unscalable. To address this bottleneck, we explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs), which include three parts: 1) Task Definition \& Evaluation Rubric: We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality rubric that provides principles for the transformation. 2) Benchmark Construction: Then we construct two extensive benchmarks to rigorously evaluate state-of-the-art generation \& understanding models on the distinct tasks of MMQA generation \& MMQA quality evaluation. 3) Preliminary Solution: We develop an agentic system (Q-Mirror), which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement. Our experiments show that while state-of-the-art models can generate MMQAs, their outputs still leave substantial gaps, underscoring the need for reliable evaluation. We further demonstrate that top-tier understanding models align closely with human judgment in MMQA quality assessment. Leveraging both insights, the Q-Mirror agent raises average scores from 78.90 to 85.22 and pass rates from 72\% to 95\%, offering a practical path to large-scale scientific benchmarks.
中文: 本研究提出将纯文本问答对转化为多模态问答对的框架,通过构建基准和智能代理系统,在迭代优化中显著提升了生成质量与评估准确性。
English: This study introduces a framework to convert text-only QA pairs into multi-modal QA pairs, creating benchmarks and an agentic system that significantly improves generation quality and evaluation accuracy through iterative refinement.

Authors:Yijin Guo, Zicheng Zhang, Ye Shen, Farong Wen, Junying Wang, Qi Jia, Guangtao Zhai
Title: QoNext: Towards Next-generation QoE for Foundation Models
Abstract:
Existing evaluations of foundation models, including recent human-centric approaches, fail to capture what truly matters: user's experience during interaction. Current methods treat evaluation as a matter of output correctness alone, overlooking that user satisfaction emerges from the interplay between response quality and interaction, which limits their ability to account for the mechanisms underlying user experience. To address this gap, we introduce QoNext, the first framework that adapts Quality of Experience (QoE) principles from networking and multimedia to the assessment of foundation models. QoNext identifies experiential factors that shape user experience and incorporates them into controlled experiments, where human ratings are collected under varied configurations. From these studies we construct a QoE-oriented database and train predictive models that estimate perceived user experience from measurable system parameters. Our results demonstrate that QoNext not only enables proactive and fine-grained evaluation but also provides actionable guidance for productized services of optimizing foundation models in practice.
中文: 现有基础模型评估仅关注输出正确性而忽视整体用户体验,为此我们提出QoNext体验质量框架,通过识别关键体验因素并训练预测模型,实现主动优化用户满意度的精细评估。
English: Current foundation model evaluations overlook the holistic user experience by focusing solely on output correctness, prompting the introduction of QoNext, a Quality of Experience framework that identifies key experiential factors and trains predictive models to proactively optimize user satisfaction.

Authors:Ye Shen, Junying Wang, Farong Wen, Yijin Guo, Qi Jia, Zicheng Zhang, Guangtao Zhai
Title: A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation
Abstract:
The rapid progress of Multi-Modal Large Language Models (MLLMs) has spurred the creation of numerous benchmarks. However, conventional full-coverage Question-Answering evaluations suffer from high redundancy and low efficiency. Inspired by human interview processes, we propose a multi-to-one interview paradigm for efficient MLLM evaluation. Our framework consists of (i) a two-stage interview strategy with pre-interview and formal interview phases, (ii) dynamic adjustment of interviewer weights to ensure fairness, and (iii) an adaptive mechanism for question difficulty-level chosen. Experiments on different benchmarks show that the proposed paradigm achieves significantly higher correlation with full-coverage results than random sampling, with improvements of up to 17.6% in PLCC and 16.7% in SRCC, while reducing the number of required questions. These findings demonstrate that the proposed paradigm provides a reliable and efficient alternative for large-scale MLLM benchmarking.
Chinese: 该研究提出了一种多对一的面试范式,用于评估多模态大语言模型,在减少所需问题数量的同时,显著提高了与全覆盖结果的相关性和评估效率。
English: The study introduces a multi-to-one interview paradigm for evaluating Multi-Modal Large Language Models, which enhances efficiency and correlation with full-coverage results while reducing the number of questions needed.

Authors:Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, Xuanjing Huang
Title: Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization
Abstract:
Multimodal large language models (MLLMs) are well suited to image aesthetic assessment, as they can capture high-level aesthetic features leveraging their cross-modal understanding capacity. However, the scarcity of multimodal aesthetic reasoning data and the inherently subjective nature of aesthetic judgment make it difficult for MLLMs to generate accurate aesthetic judgments with interpretable rationales. To this end, we propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL). Concretely, Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data used for cold-start. After teaching the model to generate structured explanations prior to scoring, we then employ the Relative-Absolute Policy Optimization (RAPO), a novel RL algorithm that jointly optimizes absolute score regression and relative ranking order, improving both per-image accuracy and cross-image preference judgments. Aes-R1 enables MLLMs to generate grounded explanations alongside faithful scores, thereby enhancing aesthetic scoring and reasoning in a unified framework. Extensive experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%, surpassing state-of-the-art baselines of similar size. More ablation studies validate Aes-R1's robust generalization under limited supervision and in out-of-distribution scenarios.
Chinese: Aes-R1框架通过创新的强化学习算法,在提升绝对评分和相对排序能力的同时,使多模态大语言模型能够生成可解释的美学判断依据与精确评分,显著增强美学评估性能。
English: The Aes-R1 framework enhances multimodal large language models' aesthetic assessment by generating interpretable rationales and accurate scores through a novel reinforcement learning approach that improves both absolute and relative judgments.

Authors:Takahiro Hattori, Kento Kawaharazuka, Kei Okada
Title: Design and Development of a Remotely Wire-Driven Walking Robot
Abstract:
Operating in environments too harsh or inaccessible for humans is one of the critical roles expected of robots. However, such environments often pose risks to electronic components as well. To overcome this, various approaches have been developed, including autonomous mobile robots without electronics, hydraulic remotely actuated mobile robots, and long-reach robot arms driven by wires. Among these, electronics-free autonomous robots cannot make complex decisions, while hydraulically actuated mobile robots and wire-driven robot arms are used in harsh environments such as nuclear power plants. Mobile robots offer greater reach and obstacle avoidance than robot arms, and wire mechanisms offer broader environmental applicability than hydraulics. However, wire-driven systems have not been used for remote actuation of mobile robots. In this study, we propose a novel mechanism called Remote Wire Drive that enables remote actuation of mobile robots via wires. This mechanism is a series connection of decoupled joints, a mechanism used in wire-driven robot arms, adapted for power transmission. We experimentally validated its feasibility by actuating a wire-driven quadruped robot, which we also developed in this study, through Remote Wire Drive.
中文摘要:本研究提出了一种新型远程线驱动机制,通过线缆实现移动机器人的远程驱动,克服了恶劣环境下电子元件的局限性,并扩展了其应用范围。
English Summary: This study introduces a novel Remote Wire Drive mechanism that enables the remote actuation of mobile robots via wires, overcoming the limitations of electronics in harsh environments and expanding their operational capabilities.

Authors:Temma Suzuki, Kento Kawaharazuka, Kei Okada
Title: A Universal Wire Testing Machine for Enhancing the Performance of Wire-Driven Robots
Abstract:
Compared with gears and linkages, wires constitute a lightweight, low-friction transmission mechanism. However, because wires are flexible materials, they tend to introduce large modeling errors, and their adoption in industrial and research robots remains limited.In this study, we built a Universal Wire Testing Machine that enables measurement and adjustment of wire characteristics to improve the performance of wire-driven mechanisms. Using this testing machine, we carried out removal of initial wire stretch, measurement of tension transmission efficiency for eight different diameters of passive pulleys, and measurement of the dynamic behavior of variable-length wires. Finally, we applied the data obtained from this testing machine to the force control of an actual wire-driven robot, reducing the end-effector force error.
中文: 本研究构建了一台通用线缆测试机,用于测量和调整线缆特性,通过减少机器人末端执行器的力误差,提升了线缆驱动机构的性能。
English: This study developed a Universal Wire Testing Machine to measure and adjust wire characteristics, improving wire-driven mechanisms' performance by reducing end-effector force errors in robots.

Authors:Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
Title: CM-Align: Consistency-based Multilingual Alignment for Large Language Models
Abstract:
Current large language models (LLMs) generally show a significant performance gap in alignment between English and other languages. To bridge this gap, existing research typically leverages the model's responses in English as a reference to select the best/worst responses in other languages, which are then used for Direct Preference Optimization (DPO) training. However, we argue that there are two limitations in the current methods that result in noisy multilingual preference data and further limited alignment performance: 1) Not all English responses are of high quality, and using a response with low quality may mislead the alignment for other languages. 2) Current methods usually use biased or heuristic approaches to construct multilingual preference pairs. To address these limitations, we design a consistency-based data selection method to construct high-quality multilingual preference data for improving multilingual alignment (CM-Align). Specifically, our method includes two parts: consistency-guided English reference selection and cross-lingual consistency-based multilingual preference data construction. Experimental results on three LLMs and three common tasks demonstrate the effectiveness and superiority of our method, which further indicates the necessity of constructing high-quality preference data.
中文: 当前大语言模型的多语言对齐方法因低质量英文参考和带有偏好的构建方式导致偏好数据噪声大,为此我们提出CM-Align,通过一致性引导选择优质英文基准并构建跨语言偏好对,实验证明该方法能有效提升多语言对齐性能。
English: Current multilingual alignment methods for LLMs suffer from noisy preference data due to low-quality English references and biased construction approaches, prompting the development of CM-Align, a consistency-based method that enhances alignment by selecting superior English benchmarks and constructing reliable cross-lingual preference pairs, validated across multiple models and tasks.

Authors:Ayano Miyamichi, Moju Zhao, Kazuki Sugihara, Junichiro Sugihara, Masanori Konishi, Kunio Kojima, Kei Okada, Masayuki Inaba
Title: Flexible Morphing Aerial Robot with Inflatable Structure for Perching-based Human-Robot Interaction
Abstract:
Birds in nature perform perching not only for rest but also for interaction with human such as the relationship with falconers. Recently, researchers achieve perching-capable aerial robots as a way to save energy, and deformable structure demonstrate significant advantages in efficiency of perching and compactness of configuration. However, ensuring flight stability remains challenging for deformable aerial robots due to the difficulty of controlling flexible arms. Furthermore, perching for human interaction requires high compliance along with safety. Thus, this study aims to develop a deformable aerial robot capable of perching on humans with high flexibility and grasping ability. To overcome the challenges of stability of both flight and perching, we propose a hybrid morphing structure that combines a unilateral flexible arm and a pneumatic inflatable actuators. This design allows the robot's arms to remain rigid during flight and soft while perching for more effective grasping. We also develop a pneumatic control system that optimizes pressure regulation while integrating shock absorption and adjustable grasping forces, enhancing interaction capabilities and energy efficiency. Besides, we focus on the structural characteristics of the unilateral flexible arm and identify sufficient conditions under which standard quadrotor modeling and control remain effective in terms of flight stability. Finally, the developed prototype demonstrates the feasibility of compliant perching maneuvers on humans, as well as the robust recovery even after arm deformation caused by thrust reductions during flight. To the best of our knowledge, this work is the first to achieve an aerial robot capable of perching on humans for interaction.
中文摘要:本研究开发了一种具有混合变形结构的可变形空中机器人,通过单侧柔性臂和气动执行器实现在飞行时保持刚性、栖附时转为柔性的模式切换,从而能够稳定飞行并柔顺地栖附在人体上进行交互。
English Summary: This study develops a deformable aerial robot with a hybrid morphing structure that enables stable flight and compliant perching on humans for interaction, using a unilateral flexible arm and pneumatic actuators to switch between rigid flight and soft grasping modes.

Authors:Jiewei Chen, Xiumei Deng, Zehui Xiong, Shaoyong Guo, Xuesong Qiu, Ping Wang, Dusit Niyato
Title: CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks
Abstract:
The increasing demand for intelligent mobile applications has made multi-agent collaboration with Transformer-based large language models (LLMs) essential in mobile edge computing (MEC) networks. However, training LLMs in such environments remains challenging due to heavy computation, high end-to-end latency, and limited model generalization. We introduce CollaPipe, a hybrid distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving intelligent networks. In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks. Then we perform global model update via federated aggregation. To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power. We derive and use a closed-form convergence bound to design an Dynamic Segment Scheduling and Resource Allocation (DSSDA) algorithm based on Lyapunov optimization, ensuring system stability under long-term constraints. Extensive experiments on downstream tasks with Transformer and BERT models show that CollaPipe improves computation efficiency by up to 15.09%, reduces end-to-end latency by at least 48.98%, and cuts single device memory usage by more than half, enabling online learning in heterogeneous and dynamic communication environments.
中文: CollaPipe是一种混合分布式学习框架,通过结合协作式流水线并行与联邦聚合,优化移动边缘计算网络中大型语言模型的训练,显著提升计算效率、降低延迟并减少内存使用,实现动态环境中的在线学习。
English: CollaPipe is a hybrid distributed learning framework that combines collaborative pipeline parallelism with federated aggregation to optimize LLM training in MEC networks, significantly improving computation efficiency, reducing latency, and cutting memory usage for online learning in dynamic environments.

Authors:Joy Jia Yin Lim, Daniel Zhang-Li, Jifan Yu, Xin Cong, Ye He, Zhiyuan Liu, Huiqin Liu, Lei Hou, Juanzi Li, Bin Xu
Title: Learning in Context: Personalizing Educational Content with Large Language Models to Enhance Student Learning
Abstract:
Standardized, one-size-fits-all educational content often fails to connect with students' individual backgrounds and interests, leading to disengagement and a perceived lack of relevance. To address this challenge, we introduce PAGE, a novel framework that leverages large language models (LLMs) to automatically personalize educational materials by adapting them to each student's unique context, such as their major and personal interests. To validate our approach, we deployed PAGE in a semester-long intelligent tutoring system and conducted a user study to evaluate its impact in an authentic educational setting. Our findings show that students who received personalized content demonstrated significantly improved learning outcomes and reported higher levels of engagement, perceived relevance, and trust compared to those who used standardized materials. This work demonstrates the practical value of LLM-powered personalization and offers key design implications for creating more effective, engaging, and trustworthy educational experiences.
中文: PAGE框架利用大型语言模型为每位学生量身定制教育内容,相比标准化材料,显著提升了学习效果、参与度和内容相关性感知。
English: The PAGE framework utilizes large language models to tailor educational content to individual students' contexts, significantly enhancing learning outcomes, engagement, and perceived relevance compared to standardized materials.

Authors:Binquan Guo, Zehui Xiong, Zhou Zhang, Baosheng Li, Dusit Niyato, Chau Yuen, Zhu Han
Title: Resilience of Mega-Satellite Constellations: How Node Failures Impact Inter-Satellite Networking Over Time?
Abstract:
Mega-satellite constellations have the potential to leverage inter-satellite links to deliver low-latency end-to-end communication services globally, thereby extending connectivity to underserved regions. However, harsh space environments make satellites vulnerable to failures, leading to node removals that disrupt inter-satellite networking. With the high risk of satellite node failures, understanding their impact on end-to-end services is essential. This study investigates the importance of individual nodes on inter-satellite networking and the resilience of mega satellite constellations against node failures. We represent the mega-satellite constellation as discrete temporal graphs and model node failure events accordingly. To quantify node importance for targeted services over time, we propose a service-aware temporal betweenness metric. Leveraging this metric, we develop an analytical framework to identify critical nodes and assess the impact of node failures. The framework takes node failure events as input and efficiently evaluates their impacts across current and subsequent time windows. Simulations on the Starlink constellation setting reveal that satellite networks inherently exhibit resilience to node failures, as their dynamic topology partially restore connectivity and mitigate the long-term impact. Furthermore, we find that the integration of rerouting mechanisms is crucial for unleashing the full resilience potential to ensure rapid recovery of inter-satellite networking.
中文摘要:巨型卫星星座虽能提供全球低延迟通信,但节点故障易破坏星间网络;本研究提出服务感知的时序中介中心性度量框架,发现动态拓扑与重路由机制能释放星座韧性,实现快速恢复。
English Summary: Mega-satellite constellations can provide global low-latency communication but face vulnerability to node failures, prompting this study to develop a service-aware framework that identifies critical nodes and reveals the networks' inherent resilience through dynamic topology and rerouting mechanisms.

Authors:Fangchen Yu, Junchi Yao, Ziyi Wang, Haiyuan Wan, Youling Huang, Bo Zhang, Shuyue Hu, Dongzhan Zhou, Ning Ding, Ganqu Cui, Lei Bai, Wanli Ouyang, Peng Ye
Title: PhysicsMinions: Winning Gold Medals in the Latest Physics Olympiads with a Coevolutionary Multimodal Multi-Agent System
Abstract:
Physics is central to understanding and shaping the real world, and the ability to solve physics problems is a key indicator of real-world physical intelligence. Physics Olympiads, renowned as the crown of competitive physics, provide a rigorous testbed requiring complex reasoning and deep multimodal understanding, yet they remain largely underexplored in AI research. Existing approaches are predominantly single-model based, and open-source MLLMs rarely reach gold-medal-level performance. To address this gap, we propose PhysicsMinions, a coevolutionary multi-agent system for Physics Olympiad. Its architecture features three synergistic studios: a Visual Studio to interpret diagrams, a Logic Studio to formulate solutions, and a Review Studio to perform dual-stage verification. The system coevolves through an iterative refinement loop where feedback from the Review Studio continuously guides the Logic Studio, enabling the system to self-correct and converge towards the ground truth. Evaluated on the HiPhO benchmark spanning 7 latest physics Olympiads, PhysicsMinions delivers three major breakthroughs: (i) Strong generalization: it consistently improves both open-source and closed-source models of different sizes, delivering clear benefits over their single-model baselines; (ii) Historic breakthroughs: it elevates open-source models from only 1-2 to 6 gold medals across 7 Olympiads, achieving the first-ever open-source gold medal in the latest International Physics Olympiad (IPhO) under the average-score metric; and (iii) Scaling to human expert: it further advances the open-source Pass@32 score to 26.8/30 points on the latest IPhO, ranking 4th of 406 contestants and far surpassing the top single-model score of 22.7 (ranked 22nd). Generally, PhysicsMinions offers a generalizable framework for Olympiad-level problem solving, with the potential to extend across disciplines.
Chinese Summary: PhysicsMinions提出了一种协同进化的多智能体系统,通过视觉解析、逻辑推理和双重验证的协同工作,在物理奥林匹克竞赛中实现突破性表现,不仅将开源模型提升至金牌水平,还在最新国际物理奥赛中超越多数人类选手排名前列。
English Summary: PhysicsMinions introduces a coevolutionary multi-agent system that synergistically interprets diagrams, formulates solutions, and performs dual-stage verification to achieve groundbreaking performance in Physics Olympiads, including elevating open-source models to gold-medal levels and ranking competitively against human contestants.

Authors:Xiaoyu Yue, Zidong Wang, Yuqing Wang, Wenlong Zhang, Xihui Liu, Wanli Ouyang, Lei Bai, Luping Zhou
Title: Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Abstract:
Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.
中文摘要:本研究提出ST-AR自引导训练框架,通过解决自回归视觉模型中局部依赖性、语义不一致和空间不变性三大核心问题,在不依赖预训练模型的情况下显著提升了图像理解能力与生成质量。
English Summary: This study introduces ST-AR, a novel self-guided training framework that addresses three key limitations in autoregressive visual models—local dependence, semantic inconsistency, and spatial invariance—significantly improving image understanding and generation quality without requiring pre-trained models.

Authors:Dong Han, Zhehong Ai, Pengxiang Cai, Shuzhou Sun, Shanya Lu, Jianpeng Chen, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao XU, Yuqiang Li, Shufei Zhang
Title: ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System
Abstract:
The efficiency of Bayesian optimization (BO) in chemistry is often hindered by sparse experimental data and complex reaction mechanisms. To overcome these limitations, we introduce ChemBOMAS, a new framework named LLM-Enhanced Multi-Agent System for accelerating BO in chemistry. ChemBOMAS's optimization process is enhanced by LLMs and synergistically employs two strategies: knowledge-driven coarse-grained optimization and data-driven fine-grained optimization. First, in the knowledge-driven coarse-grained optimization stage, LLMs intelligently decompose the vast search space by reasoning over existing chemical knowledge to identify promising candidate regions. Subsequently, in the data-driven fine-grained optimization stage, LLMs enhance the BO process within these candidate regions by generating pseudo-data points, thereby improving data utilization efficiency and accelerating convergence. Benchmark evaluations** further confirm that ChemBOMAS significantly enhances optimization effectiveness and efficiency compared to various BO algorithms. Importantly, the practical utility of ChemBOMAS was validated through wet-lab experiments conducted under pharmaceutical industry protocols, targeting conditional optimization for a previously unreported and challenging chemical reaction. In the wet experiment, ChemBOMAS achieved an optimal objective value of 96%. This was substantially higher than the 15% achieved by domain experts. This real-world success, together with strong performance on benchmark evaluations, highlights ChemBOMAS as a powerful tool to accelerate chemical discovery.
中文:ChemBOMAS是一个LLM增强的多智能体系统,通过结合知识驱动的粗粒度优化和数据驱动的细粒度优化,克服了化学中贝叶斯优化的局限性,显著提高了效率,并在实际制药实验中取得了96%的成功率。
English: ChemBOMAS is an LLM-enhanced multi-agent system that overcomes Bayesian optimization limitations in chemistry by combining knowledge-driven coarse-grained and data-driven fine-grained optimization, significantly improving efficiency and achieving a 96% success rate in real-world pharmaceutical experiments.

Authors:Yingchen Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng
Title: Does Generative Retrieval Overcome the Limitations of Dense Retrieval?
Abstract:
Generative retrieval (GR) has emerged as a new paradigm in neural information retrieval, offering an alternative to dense retrieval (DR) by directly generating identifiers of relevant documents. In this paper, we theoretically and empirically investigate how GR fundamentally diverges from DR in both learning objectives and representational capacity. GR performs globally normalized maximum-likelihood optimization and encodes corpus and relevance information directly in the model parameters, whereas DR adopts locally normalized objectives and represents the corpus with external embeddings before computing similarity via a bilinear interaction. Our analysis suggests that, under scaling, GR can overcome the inherent limitations of DR, yielding two major benefits. First, with larger corpora, GR avoids the sharp performance degradation caused by the optimization drift induced by DR's local normalization. Second, with larger models, GR's representational capacity scales with parameter size, unconstrained by the global low-rank structure that limits DR. We validate these theoretical insights through controlled experiments on the Natural Questions and MS MARCO datasets, across varying negative sampling strategies, embedding dimensions, and model scales. But despite its theoretical advantages, GR does not universally outperform DR in practice. We outline directions to bridge the gap between GR's theoretical potential and practical performance, providing guidance for future research in scalable and robust generative retrieval.
中文摘要:生成式检索通过全局优化和参数化表征理论上克服了稠密检索的扩展瓶颈,但实际性能仍有差距需进一步研究。
English Summary: Generative retrieval theoretically surpasses dense retrieval by overcoming scaling limitations through global optimization and parameter-based representation, though practical performance gaps remain.

Authors:Changjiang Zhou, Ruqing Zhang, Jiafeng Guo, Yu-An Liu, Fan Zhang, Ganyuan Luo, Xueqi Cheng
Title: A Generative Framework for Personalized Sticker Retrieval
Abstract:
Formulating information retrieval as a variant of generative modeling, specifically using autoregressive models to generate relevant identifiers for a given query, has recently attracted considerable attention. However, its application to personalized sticker retrieval remains largely unexplored and presents unique challenges: existing relevance-based generative retrieval methods typically lack personalization, leading to a mismatch between diverse user expectations and the retrieved results. To address this gap, we propose PEARL, a novel generative framework for personalized sticker retrieval, and make two key contributions: (i) To encode user-specific sticker preferences, we design a representation learning model to learn discriminative user representations. It is trained on three prediction tasks that leverage personal information and click history; and (ii) To generate stickers aligned with a user's query intent, we propose a novel intent-aware learning objective that prioritizes stickers associated with higher-ranked intents. Empirical results from both offline evaluations and online tests demonstrate that PEARL significantly outperforms state-of-the-art methods.
Chinese: 该摘要提出PEARL,一种新颖的个性化表情检索生成框架,通过表征学习和意图感知目标来学习用户特定偏好,解决了现有方法缺乏个性化的问题,在评估中显著优于最先进方法。
English: This abstract introduces PEARL, a novel generative framework for personalized sticker retrieval that addresses the lack of personalization in existing methods by learning user-specific preferences through representation learning and an intent-aware objective, significantly outperforming state-of-the-art approaches in evaluations.

Authors:Yunfei Zhong, Jun Yang, Yixing Fan, Jiafeng Guo, Lixin Su, Maarten de Rijke, Ruqing Zhang, Dawei Yin, Xueqi Cheng
Title: Reasoning-enhanced Query Understanding through Decomposition and Interpretation
Abstract:
Accurate inference of user intent is crucial for enhancing document retrieval in modern search engines. While large language models (LLMs) have made significant strides in this area, their effectiveness has predominantly been assessed with short, keyword-based queries. As AI-driven search evolves, long-form queries with intricate intents are becoming more prevalent, yet they remain underexplored in the context of LLM-based query understanding (QU). To bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation. ReDI leverages the reasoning and comprehension capabilities of LLMs in a three-stage pipeline: (i) it breaks down complex queries into targeted sub-queries to accurately capture user intent; (ii) it enriches each sub-query with detailed semantic interpretations to improve the query-document matching; and (iii) it independently retrieves documents for each sub-query and employs a fusion strategy to aggregate the results for the final ranking. We compiled a large-scale dataset of real-world complex queries from a major search engine and distilled the query understanding capabilities of teacher models into smaller models for practical application. Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms, affirming its effectiveness.
Chinese: ReDI提出了一种推理增强方法,利用大语言模型将复杂查询分解为子查询并丰富语义解释,通过融合检索结果在稀疏和稠密检索范式中均展现出优于现有方法的性能。
English: ReDI introduces a reasoning-enhanced approach using large language models to decompose complex queries into sub-queries, enrich them with interpretations, and fuse retrieval results, demonstrating superior performance over existing methods in both sparse and dense retrieval paradigms.

Authors:Yuyao Ge, Shenghua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Xuanshan Zhou, Jiayu Yao, Jiafeng Guo, Xueqi Cheng
Title: Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
Abstract:
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.
中文: 视觉语言模型在复杂视觉环境中表现不佳,而提出的CARVE方法通过对比注意力机制,无需额外训练即可提升性能,最高达75%,有效分离任务相关信号与视觉噪声。
English: Vision-Language Models (VLMs) struggle in complex visual environments, but the proposed CARVE method leverages attention contrast to enhance performance without additional training, achieving up to 75% improvement by isolating task-relevant signals from visual noise.

Authors:Guanjun Wu, Jiemin Fang, Chen Yang, Sikuang Li, Taoran Yi, Jia Lu, Zanwei Zhou, Jiazhong Cen, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Xinggang Wang, Qi Tian
Title: UniLat3D: Geometry-Appearance Unified Latents for Single-Stage 3D Generation
Abstract:
High-fidelity 3D asset generation is crucial for various industries. While recent 3D pretrained models show strong capability in producing realistic content, most are built upon diffusion models and follow a two-stage pipeline that first generates geometry and then synthesizes appearance. Such a decoupled design tends to produce geometry-texture misalignment and non-negligible cost. In this paper, we propose UniLat3D, a unified framework that encodes geometry and appearance in a single latent space, enabling direct single-stage generation. Our key contribution is a geometry-appearance Unified VAE, which compresses high-resolution sparse features into a compact latent representation -- UniLat. UniLat integrates structural and visual information into a dense low-resolution latent, which can be efficiently decoded into diverse 3D formats, e.g., 3D Gaussians and meshes. Based on this unified representation, we train a single flow-matching model to map Gaussian noise directly into UniLat, eliminating redundant stages. Trained solely on public datasets, UniLat3D produces high-quality 3D assets in seconds from a single image, achieving superior appearance fidelity and geometric quality. More demos \& code are available at https://unilat3d.github.io/
中文: UniLat3D提出了一种统一单阶段框架,将几何与外观整合到单一潜在空间中,实现了从图像快速生成高质量三维资产的高效方法。
English: UniLat3D introduces a unified single-stage framework that integrates geometry and appearance into a single latent space, enabling efficient high-quality 3D asset generation from images in seconds.

Authors:Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu
Title: SCUBA: Salesforce Computer Use Benchmark
Abstract:
We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.
中文: SCUBA是一个用于评估Salesforce平台客户关系管理工作流程中计算机使用代理的基准,包含300个真实任务,揭示了开源与闭源模型间的显著性能差距,示范增强方法可将任务成功率提升至50%。
English: SCUBA is a benchmark for evaluating computer-use agents on Salesforce CRM workflows, featuring 300 realistic tasks that reveal significant performance gaps between open-source and closed-source models, with demonstration-augmented methods achieving up to 50% success rates.

Authors:Wanli Yang, Fei Sun, Rui Tang, Hongyu Zang, Du Su, Qi Cao, Jingang Wang, Huawei Shen, Xueqi Cheng
Title: Fine-tuning Done Right in Model Editing
Abstract:
Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing. Here, we challenge this belief, arguing that the reported failure arises not from the inherent limitation of fine-tuning itself, but from adapting it to the sequential nature of the editing task, a single-pass depth-first pipeline that optimizes each sample to convergence before moving on. While intuitive, this depth-first pipeline coupled with sample-wise updating over-optimizes each edit and induces interference across edits. Our controlled experiments reveal that simply restoring fine-tuning to the standard breadth-first (i.e., epoch-based) pipeline with mini-batch optimization substantially improves its effectiveness for model editing. Moreover, fine-tuning in editing also suffers from suboptimal tuning parameter locations inherited from prior methods. Through systematic analysis of tuning locations, we derive LocFT-BF, a simple and effective localized editing method built on the restored fine-tuning framework. Extensive experiments across diverse LLMs and datasets demonstrate that LocFT-BF outperforms state-of-the-art methods by large margins. Notably, to our knowledge, it is the first to sustain 100K edits and 72B-parameter models,10 x beyond prior practice, without sacrificing general capabilities. By clarifying a long-standing misconception and introducing a principled localized tuning strategy, we advance fine-tuning from an underestimated baseline to a leading method for model editing, establishing a solid foundation for future research.
中文摘要:传统认为微调不适用于模型编辑,但研究表明,通过恢复为广度优先的小批量优化流程并结合局部参数调优,微调不仅效果显著提升,还能支持大规模编辑和模型参数,成为领先的编辑方法。
English Summary: Fine-tuning, traditionally deemed ineffective for model editing, is shown to be highly effective when restored to a breadth-first pipeline with mini-batch optimization and localized parameter tuning, outperforming state-of-the-art methods and scaling to unprecedented edit volumes and model sizes.

Authors:Cheng Qian, Zuxin Liu, Akshara Prabhakar, Jielin Qiu, Zhiwei Liu, Haolin Chen, Shirley Kokane, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
Title: UserRL: Training Interactive User-Centric Agent via Reinforcement Learning
Abstract:
Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.
中文: UserRL提出了一个通过标准化环境与模拟用户训练和评估以用户为中心的智能体模型的统一框架,研究表明奖励机制设计和用户模拟选择与模型规模同等重要,共同决定了多轮交互能力的稳健发展。
English: UserRL introduces a unified framework for training and evaluating user-centric agentic models through standardized gym environments with simulated users, demonstrating that reward shaping and user simulation choices are as vital as model scale for developing robust multi-turn interaction abilities.

Authors:Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao
Title: Online Process Reward Leanring for Agentic Reinforcement Learning
Abstract:
Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, sparse and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state overlap is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent's policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios.
中文摘要:本文提出iStar方法,通过隐式步骤奖励解决大型语言模型智能体训练中的稀疏奖励难题,在多个基准测试中实现最优性能,同时提升样本效率和训练稳定性。
English Summary: The paper introduces iStar, a novel credit-assignment strategy for training LLM agents that uses implicit step rewards to overcome sparse reward challenges, achieving state-of-the-art performance across multiple benchmarks with improved efficiency and stability.

Authors:Xiaoqian Liu, Ke Wang, Yuchuan Wu, Fei Huang, Yongbin Li, Junge Zhang, Jianbin Jiao
Title: Agentic Reinforcement Learning with Implicit Step Rewards
Abstract:
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL) that reason and act in interactive environments. However, sparse and sometimes unverifiable rewards make it extremely challenging to assign credit when training LLM agents that serve as a policy. Recent work attempts to integrate process supervision into RL but suffers from biased annotation, reward hacking, high-variance from overly fine-grained rewards or failtures when state overlap is rare. We therefore introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms without relying on additional rollouts or explicit step labels. Particularly, we alternatively optimize an implicit process reward model (PRM) with the policy model to generate implicit step rewards via a trajectory-based DPO objective. Theoretical analysis shows that this learning objective produces a step-wise reward function. Then the implicit step rewards are used to compute step-level advantages, which are combined with trajectory (or episode)-level advantages for policy updates, creating a self-reinforcing training loop. We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA. Crucially, iStar shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and training stability. Further analysis also demonstrates efficient exploration by iStar with increased rewards in both step- and episode-level while maintaining fewer steps to achieve task success. Code will be available soon.
中文摘要:本文提出iStar方法,通过隐式步骤奖励解决大型语言模型智能体训练中的稀疏奖励难题,在多个基准测试中实现最优性能,同时提升样本效率和训练稳定性。
English Summary: The paper introduces iStar, a novel credit-assignment strategy for training LLM agents that uses implicit step rewards to overcome sparse reward challenges, achieving state-of-the-art performance across multiple benchmarks with improved efficiency and stability.

Authors:Zongyue Xue, Siyuan Zheng, Shaochun Wang, Yiran Hu, Shenran Wang, Yuxin Yao, Haitao Li, Qingyao Ai, Yiqun Liu, Yun Liu, Weixing Shen
Title: JustEva: A Toolkit to Evaluate LLM Fairness in Legal Knowledge Inference
Abstract:
The integration of Large Language Models (LLMs) into legal practice raises pressing concerns about judicial fairness, particularly due to the nature of their "black-box" processes. This study introduces JustEva, a comprehensive, open-source evaluation toolkit designed to measure LLM fairness in legal tasks. JustEva features several advantages: (1) a structured label system covering 65 extra-legal factors; (2) three core fairness metrics - inconsistency, bias, and imbalanced inaccuracy; (3) robust statistical inference methods; and (4) informative visualizations. The toolkit supports two types of experiments, enabling a complete evaluation workflow: (1) generating structured outputs from LLMs using a provided dataset, and (2) conducting statistical analysis and inference on LLMs' outputs through regression and other statistical methods. Empirical application of JustEva reveals significant fairness deficiencies in current LLMs, highlighting the lack of fair and trustworthy LLM legal tools. JustEva offers a convenient tool and methodological foundation for evaluating and improving algorithmic fairness in the legal domain.
Chinese: 本研究推出JustEva开源工具包,通过评估不一致性、偏见和不平衡误差三大核心指标来检测大语言模型在法律任务中的公平性,实证研究发现现有模型存在明显缺陷。
English: This study introduces JustEva, an open-source toolkit designed to evaluate the fairness of Large Language Models in legal tasks by measuring inconsistency, bias, and imbalanced inaccuracy, revealing significant fairness issues in current models.

Authors:Junjie Chen, Haitao Li, Minghao Qin, Yujia Zhou, Yanxue Ren, Wuyue Wang, Yiqun Liu, Yueyue Wu, Qingyao Ai
Title: Simulating Dispute Mediation with LLM-Based Agents for Legal Research
Abstract:
Legal dispute mediation plays a crucial role in resolving civil disputes, yet its empirical study is limited by privacy constraints and complex multivariate interactions. To address this limitation, we present AgentMediation, the first LLM-based agent framework for simulating dispute mediation. It simulates realistic mediation processes grounded in real-world disputes and enables controlled experimentation on key variables such as disputant strategies, dispute causes, and mediator expertise. Our empirical analysis reveals patterns consistent with sociological theories, including Group Polarization and Surface-level Consensus. As a comprehensive and extensible platform, AgentMediation paves the way for deeper integration of social science and AI in legal research.
Chinese: AgentMediation是首个基于大语言模型的调解代理框架,通过模拟真实法律纠纷调解过程,实现对关键变量的可控实验,并揭示了与群体极化和表面共识等社会学理论一致的模式。
English: AgentMediation is the first LLM-based agent framework that simulates realistic legal dispute mediation processes, enabling controlled experiments on key variables and revealing patterns consistent with sociological theories like Group Polarization.

Authors:Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty
Title: SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
Abstract:
Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.
中文摘要:本文通过持续强化学习与合成数据,开发了用于深度研究的自主单智能体模型,在保持推理能力的同时显著提升了智能体技能,并在基准测试中取得了优异表现。
English Summary: This paper develops autonomous single-agent models for deep research using continual reinforcement learning with synthetic data, achieving significant performance on benchmarks while enhancing agentic skills without compromising reasoning abilities.

Authors:Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles
Title: Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Abstract:
Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.
中文摘要:Strefer是一种合成数据生成框架,通过创建多样化的指令调优数据来增强视频大语言模型的时空推理能力,使其无需昂贵的人工标注即可更准确地解析视频中的空间和时间参照。
English Summary: Strefer is a synthetic data generation framework that enhances Video LLMs' spatiotemporal reasoning by creating diverse instruction-tuning data, enabling them to better interpret spatial and temporal references in videos without costly human annotation.

Authors:Minghuan Liu, Zhengbang Zhu, Xiaoshen Han, Peng Hu, Haotong Lin, Xinyao Li, Jingxiao Chen, Jiafeng Xu, Yichu Yang, Yunfeng Lin, Xinghang Li, Yong Yu, Weinan Zhang, Tao Kong, Bingyi Kang
Title: Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots
Abstract:
Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.
中文摘要:本研究提出相机深度模型(CDMs),通过将含噪深度相机数据转化为精确三维几何信息,使仅基于模拟深度训练的策略能无缝迁移至现实机器人操作任务,并在处理复杂物体时保持性能稳定。
English Summary: This study introduces Camera Depth Models (CDMs) to enhance robotic manipulation by converting noisy depth camera data into accurate 3D geometric information, enabling policies trained solely on simulated depth to generalize effectively to real-world tasks without performance loss.

Authors:Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li, Ranjie Duan, Qiang Liu, Hang Su, Yinpeng Dong, Jun Zhu
Title: Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Abstract:
Although Large Reasoning Models (LRMs) have progressed in solving complex problems, their chain-of-thought (CoT) reasoning often contains harmful content that can persist even when the final responses appear safe. We show that this issue still remains in existing methods which overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users. We therefore shift our focus to aligning the safety of reasoning itself in this paper and explore process supervision as the solution. However, simply rewarding safe reasoning proves inadequate due to low rollout diversity and limited training signals. To tackle this challenge, we first delve into the characteristics of safe reasoning and uncover several critical insights that 1) safe reasoning is often consolidated by a few critical steps of safety triggers; 2) compliance cues strongly correlate with unsafe continuations; and 3) corrective interventions reliably steer unsafe trajectories towards safer traces. Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals. Experiments on jailbreak and adversarial safety benchmarks demonstrate that IPO remarkably improves overall safety regarding both reasoning and responses, outperforming SFT-based and RL-based baselines with a relative reduction of over 30% in harmfulness, while preserving excellent performance across diverse reasoning tasks. The results highlight the importance of explicit alignment for reasoning and provide a practical path to safer LRMs.
Large Reasoning Models often generate unsafe chain-of-thought reasoning despite appearing safe in final responses, so this paper proposes Intervened Preference Optimization (IPO) to align reasoning safety by replacing compliance steps with safety triggers, achieving over 30% harm reduction while maintaining reasoning performance.
English Summary:

Authors:Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu
Title: DiffusionNFT: Online Diffusion Reinforcement with Forward Process
Abstract:
Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.
中文摘要:DiffusionNFT提出了一种通过流匹配直接优化扩散模型的高效在线强化学习范式,无需依赖求解器限制和分类器自由引导,能以更高效率实现更优性能。
English Summary: DiffusionNFT introduces an efficient online RL paradigm that optimizes diffusion models directly via flow matching, eliminating solver restrictions and CFG dependency while achieving superior performance with greater efficiency.

Authors:Yan Sun, Yinqiu Liu, Shaoyong Guo, Ruichen Zhang, Jiacheng Wang, Xuesong Qiu, Geng Sun, Weifeng Gong, Dusit Niyato, Qihui Wu
Title: A Synergy of Computing Power Networks and Low-Altitude Economy Intelligent Communications: Challenges, Design Principles, and Research Directions
Abstract:
The rapid development of the Low-Altitude Economy (LAE) has created opportunities for emerging services such as autonomous aerial transportation, aerial sensing, and emergency response, all of which rely on efficient and intelligent communications. However, LAE intelligent communications face several challenges, including the limited computational capacity of aerial nodes, the lack of cross-scenario generalization, and the complexity of heterogeneous demands. Meanwhile, Computing Power Networks (CPNs) have emerged as a new paradigm for integrating distributed computing, networking, and storage resources, but they are also constrained by static deployment and limited adaptability. In this survey, we explore the synergy between LAE intelligent communications and CPNs. We first analyze how CPNs can support LAE intelligent communications in areas such as air-ground collaborative control, AI training, communication-computation co-ptimization, and ubiquitous low-altitude information processing. Conversely, we discuss how LAE intelligent communications can enhance CPNs through mobility-assisted control, distributed intelligent training, dynamic routing, and in-network aerial computing. Finally, based on these insights, we outline design principles and future research directions for integrated CPN-LAE systems. This work provides a comprehensive foundation for building flexible, adaptive, and resilient architectures that leverage the synergy between CPNs and LAE to deliver high-quality and sustainable low-altitude services.
中文摘要:本综述探讨低空经济智能通信与算力网络的协同融合,通过空地协同控制与动态资源优化等领域的相互增强解决现有挑战,为构建可持续低空服务的自适应架构奠定基础。
English Summary: This survey explores the synergistic integration of Low-Altitude Economy intelligent communications and Computing Power Networks, addressing challenges through mutual enhancement in areas like collaborative control and dynamic resource optimization to build adaptive architectures for sustainable services.

Authors:Wenwen Xie, Geng Sun, Jiahui Li, Jiacheng Wang, Yinqiu Liu, Dusit Niyato, Dong In Kim, Shiwen Mao
Title: RIS-assisted Data Collection and Wireless Power Transfer in Low-altitude Wireless Networks
Abstract:
Low-altitude wireless networks (LAWNs) have become effective solutions for collecting data from low-power Internet-of-Things devices (IoTDs) in remote areas with limited communication infrastructure. However, some outdoor IoTDs deployed in such areas face both energy constraints and low-channel quality challenges, making it challenging to ensure timely data collection from these IoTDs in LAWNs. In this work, we investigate a reconfigurable intelligent surface (RIS)-assisted uncrewed aerial vehicle (UAV)-enabled data collection and wireless power transfer system in LAWN. Specifically, IoTDs first harvest energy from a low-altitude UAV, and then upload their data to the UAV by applying the time division multiple access (TDMA) protocol, supported by an RIS to improve the channel quality. To maintain satisfactory data freshness of the IoTDs and save energy for an energy-constrained UAV, we aim to minimize the age of information (AoI) and energy consumption of the UAV by jointly optimizing the RIS phase shits, UAV trajectory, charging time allocation, and binary IoTD scheduling. We propose a deep reinforcement learning (DRL)-based approach, namely the alternating optimization-improved parameterized deep Q-network (AO-IPDQN). Specifically, considering that RIS typically contains a large number of reflecting elements, we first adopt an alternating optimization (AO) method to optimize the RIS phase shifts to reduce the dimension of the action space. Then, we propose the improved parameterized deep Q-network (IPDQN) method to deal with the hybrid action space. Simulation results indicate that AO-IPDQN approach achieves excellent performance relative to multiple comparison methods across various simulation scenarios.
Chinese: 本研究提出一种名为AO-IPDQN的深度强化学习方法,通过联合优化RIS相位偏移、无人机轨迹和资源分配,在辅助无人机系统中实现远程物联网设备数据采集的新鲜度和能效优化。
English: This study introduces a deep reinforcement learning approach called AO-IPDQN to optimize data freshness and energy efficiency in RIS-assisted UAV systems for collecting data from energy-constrained IoT devices in remote areas with limited infrastructure.

Authors:Ping Zhang, Xiaodong Xu, Mengying Sun, Haixiao Gao, Nan Ma, Xiaoyun Wang, Ruichen Zhang, Jiacheng Wang, Dusit Niyato
Title: Towards Native AI in 6G Standardization: The Roadmap of Semantic Communication
Abstract:
Semantic communication (SemCom) has emerged as a transformative paradigm for future 6G networks, offering task-oriented and meaning-aware transmission that fundamentally redefines traditional bit-centric design. Recognized by leading standardization bodies including the institute of electrical and electronics engineers (IEEE) and the international telecommunication union (ITU), and actively discussed within the 3rd generation partnership project (3GPP) working groups, SemCom is rapidly gaining traction as a foundational enabler for native-AI 6G. This paper presents a comprehensive overview of recent progress in SemCom from both academic and industrial perspectives, with a focus on its ongoing and upcoming standardization activities. We systematically examine advances in representative application scenarios, architectural design, semantic-traditional system compatibility, unified evaluation metrics, and validation methodologies. Furthermore, we highlight several key enabling technologies, such as joint source-channel coding (JSCC), SemCom-based multiple access (MA) technologies such as model division MA (MDMA), and semantic knowledge base (KB), that support the practical implementation of SemCom in standard-compliant systems. Additionally, we present a case study for channel state information (CSI) feedback, illustrating the concrete performance gains of SemCom under 3GPP-compliant fading channels. Finally, we discuss emerging challenges and research opportunities for incorporating semantic-native mechanisms into the evolving 6G standardization landscape, and provide forward-looking insights into its development and global adoption.
中文: 语义通信正成为6G的变革性范式,从以比特为中心转向面向任务的传输,本文全面综述了其标准化进展、关键技术和实际应用,并探讨了未来挑战。
English: Semantic communication is emerging as a transformative 6G paradigm that shifts from bit-centric to task-oriented transmission, with this paper providing a comprehensive overview of its standardization progress, key technologies, and practical applications while addressing future challenges.

Authors:Zifan Lang, Guixia Liu, Geng Sun, Jiahui Li, Jiacheng Wang, Weijie Yuan, Dusit Niyato, Dong In Kim
Title: Joint AoI and Handover Optimization in Space-Air-Ground Integrated Network
Abstract:
Despite the widespread deployment of terrestrial networks, providing reliable communication services to remote areas and maintaining connectivity during emergencies remains challenging. Low Earth orbit (LEO) satellite constellations offer promising solutions with their global coverage capabilities and reduced latency, yet struggle with intermittent coverage and limited communication windows due to orbital dynamics. This paper introduces an age of information (AoI)-aware space-air-ground integrated network (SAGIN) architecture that leverages a high-altitude platform (HAP) as intelligent relay between the LEO satellites and ground terminals. Our three-layer design employs hybrid free-space optical (FSO) links for high-capacity satellite-to-HAP communication and reliable radio frequency (RF) links for HAP-to-ground transmission, and thus addressing the temporal discontinuity in LEO satellite coverage while serving diverse user priorities. Specifically, we formulate a joint optimization problem to simultaneously minimize the AoI and satellite handover frequency through optimal transmit power distribution and satellite selection decisions. This highly dynamic, non-convex problem with time-coupled constraints presents significant computational challenges for traditional approaches. To address these difficulties, we propose a novel diffusion model (DM)-enhanced dueling double deep Q-network with action decomposition and state transformer encoder (DD3QN-AS) algorithm that incorporates transformer-based temporal feature extraction and employs a DM-based latent prompt generative module to refine state-action representations through conditional denoising. Simulation results highlight the superior performance of the proposed approach compared with policy-based methods and some other deep reinforcement learning (DRL) benchmarks.
中文摘要:本文提出了一种利用高空平台作为中继的信息年龄感知空天地一体化网络架构,通过新型深度强化学习算法同时优化信息新鲜度和卫星切换频率,有效解决低轨卫星通信的间歇性覆盖问题。
English Summary: This paper proposes an age of information-aware space-air-ground integrated network using a high-altitude platform as a relay to address intermittent LEO satellite coverage, introducing a novel deep reinforcement learning algorithm that optimizes both information freshness and satellite handover frequency.

Authors:Haoxiang Luo, Yu Yan, Yanhui Bian, Wenjiao Feng, Ruichen Zhang, Yinqiu Liu, Jiacheng Wang, Gang Sun, Dusit Niyato, Hongfang Yu, Abbas Jamalipour, Shiwen Mao
Title: AI Reasoning for Wireless Communications and Networking: A Survey and Perspectives
Abstract:
Artificial Intelligence (AI) techniques play a pivotal role in optimizing wireless communication networks. However, traditional deep learning approaches often act as closed boxes, lacking the structured reasoning abilities needed to tackle complex, multi-step decision problems. This survey provides a comprehensive review and outlook of reasoning-enabled AI in wireless communication networks, with a focus on Large Language Models (LLMs) and other advanced reasoning paradigms. In particular, LLM-based agents can combine reasoning with long-term planning, memory, tool utilization, and autonomous cross-layer control to dynamically optimize network operations with minimal human intervention. We begin by outlining the evolution of intelligent wireless networking and the limitations of conventional AI methods. We then introduce emerging AI reasoning techniques. Furthermore, we establish a classification system applicable to wireless network tasks. We also present a layer-by-layer examination for AI reasoning, covering the physical, data link, network, transport, and application layers. For each part, we identify key challenges and illustrate how AI reasoning methods can improve AI-based wireless communication performance. Finally, we discuss key research directions for AI reasoning toward future wireless communication networks. By combining insights from both communications and AI, this survey aims to chart a path for integrating reasoning techniques into the next-generation wireless networks.
中文: 本综述探讨了以大型语言模型为代表的推理赋能人工智能在无线网络优化中的应用,通过结构化推理、跨层级分析和自主控制突破传统深度学习的局限,旨在以最少人工干预提升网络性能。
English: This survey explores reasoning-enabled AI, particularly Large Language Models, for optimizing wireless networks by overcoming the limitations of traditional deep learning through structured reasoning, multi-layer analysis, and autonomous control to enhance performance with minimal human intervention.

Authors:Zifan Lang, Guixia Liu, Jiahui Li, Geng Sun, Zemin Sun, Jiacheng Wang, Dusit Niyato, Victor C. M. Leung
Title: Multi-AAV-enabled Distributed Beamforming in Low-Altitude Wireless Networking for AoI-Sensitive IoT Data Forwarding
Abstract:
With the rapid development of low-altitude wireless networking, autonomous aerial vehicles (AAVs) have emerged as critical enablers for timely and reliable data delivery, particularly in remote or underserved areas. In this context, the age of information (AoI) has emerged as a critical performance metric for evaluating the freshness and timeliness of transmitted information in Internet of things (IoT) networks. However, conventional AAV-assisted data transmission is fundamentally limited by finite communication coverage ranges, which requires periodic return flights for data relay operations. This propulsion-repositioning cycle inevitably introduces latency spikes that raise the AoI while degrading service reliability. To address these challenges, this paper proposes a AAV-assisted forwarding system based on distributed beamforming to enhance the AoI in IoT. Specifically, AAVs collaborate via distributed beamforming to collect and relay data between the sensor nodes and remote base station. Then, we formulate an optimization problem to minimize the AoI and AAV energy consumption, by jointly optimizing the AAV trajectories and communication schedules. Due to the non-convex nature of the problem and its pronounced temporal variability, we introduce a deep reinforcement learning solution that incorporates temporal sequence input, layer normalization gated recurrent unit, and a squeeze-and-excitation block to capture long-term dependencies, thereby improving decision-making stability and accuracy, and reducing computational complexity. Simulation results demonstrate that the proposed SAC-TLS algorithm outperforms baseline algorithms in terms of convergence, time average AoI, and energy consumption of AAVs.
中文: 本文提出了一种基于分布式波束成形的自主飞行器辅助转发系统,通过深度强化学习优化飞行轨迹和通信调度,以降低物联网网络中的信息年龄和能耗。
English: This paper proposes an AAV-assisted forwarding system using distributed beamforming to enhance information freshness in IoT networks, employing a deep reinforcement learning approach to optimize trajectories and communication schedules for reduced latency and energy consumption.

Authors:Hang Yin, Haoyu Wei, Xiuwei Xu, Wenxuan Guo, Jie Zhou, Jiwen Lu
Title: GC-VLN: Instruction as Graph Constraints for Training-free Vision-and-Language Navigation
Abstract:
In this paper, we propose a training-free framework for vision-and-language navigation (VLN). Existing zero-shot VLN methods are mainly designed for discrete environments or involve unsupervised training in continuous simulator environments, which makes it challenging to generalize and deploy them in real-world scenarios. To achieve a training-free framework in continuous environments, our framework formulates navigation guidance as graph constraint optimization by decomposing instructions into explicit spatial constraints. The constraint-driven paradigm decodes spatial semantics through constraint solving, enabling zero-shot adaptation to unseen environments. Specifically, we construct a spatial constraint library covering all types of spatial relationship mentioned in VLN instructions. The human instruction is decomposed into a directed acyclic graph, with waypoint nodes, object nodes and edges, which are used as queries to retrieve the library to build the graph constraints. The graph constraint optimization is solved by the constraint solver to determine the positions of waypoints, obtaining the robot's navigation path and final goal. To handle cases of no solution or multiple solutions, we construct a navigation tree and the backtracking mechanism. Extensive experiments on standard benchmarks demonstrate significant improvements in success rate and navigation efficiency compared to state-of-the-art zero-shot VLN methods. We further conduct real-world experiments to show that our framework can effectively generalize to new environments and instruction sets, paving the way for a more robust and autonomous navigation framework.
中文: 本文提出了一种免训练的视觉语言导航框架,通过将导航指令转化为图约束优化问题,在模拟和真实环境中均实现了卓越的零样本导航性能。
English: This paper introduces a training-free framework for vision-and-language navigation that formulates navigation as graph constraint optimization, achieving superior zero-shot performance in both simulated and real-world environments.

Authors:Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen, Zhi-Hong Deng
Title: Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.
Chinese: SEELE框架通过自适应调整提示长度来动态匹配问题难度与模型能力,显著提升了强化学习的探索效率,在多项数学推理基准测试中超越了现有最佳方法。
English: The SEELE framework enhances reinforcement learning with verifiable rewards by dynamically adjusting problem difficulty through adaptive hint lengths, significantly improving exploration efficiency and outperforming previous methods across multiple math reasoning benchmarks.

Authors:Jue Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Title: From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models
Abstract:
Large Reasoning Models (LRMs) generate explicit reasoning traces alongside final answers, yet the extent to which these traces influence answer generation remains unclear. In this work, we conduct a three-stage investigation into the interplay between reasoning and answer generation in three distilled DeepSeek R1 models. First, through empirical evaluation, we demonstrate that including explicit reasoning consistently improves answer quality across diverse domains. Second, attention analysis reveals that answer tokens attend substantially to reasoning tokens, with certain mid-layer Reasoning-Focus Heads (RFHs) closely tracking the reasoning trajectory, including self-reflective cues. Third, we apply mechanistic interventions using activation patching to assess the dependence of answer tokens on reasoning activations. Our results show that perturbations to key reasoning tokens can reliably alter the final answers, confirming a directional and functional flow of information from reasoning to answer. These findings deepen our understanding of how LRMs leverage reasoning tokens for answer generation, highlighting the functional role of intermediate reasoning in shaping model outputs. Our data and code are publicly available at \href{https://aka.ms/R2A-code}{this URL}.
中文摘要:大型推理模型通过生成显式推理轨迹提升答案质量,机制干预实验证实答案标记在功能上依赖于推理激活来最终决定输出结果。
English Summary: Large Reasoning Models improve answer quality by generating explicit reasoning traces, with mechanistic interventions confirming that answer tokens functionally depend on reasoning activations to shape final outputs.

Authors:Yuhang Xie, Jian Mu, Xiaojun Ma, Chaoyun Zhang, Lu Wang, Mengyu Zhou, Mugeng Liu, Si Qin, Qingwei Lin, Saravan Rajmohan, Shi Han, Dongmei Zhang
Title: No More Manual Guides: Automatic and Scalable Generation of High-Quality Excel Tutorials
Abstract:
Excel is one of the most widely used productivity tools across domains, offering rich functionality but also overwhelming users with its complexity. This creates a persistent demand for tutorials to support effective usage. However, existing tutorials are manually authored by experts, require frequent updates after each software release, and incur substantial labor costs. Prior work has not achieved fully automated tutorial generation, since existing methods still depend on handcrafted operation sequences or example materials. In this paper, we present the first framework for automatically generating Excel tutorials directly from natural language task descriptions. Our framework first instantiates the task. Then a central component of this framework, Execution Agent, plans and executes the solution in Excel, and collects the intermediate artifacts required for tutorial construction. These artifacts are then transformed into both structured Excel documents and video demonstrations. To build a comprehensive tutorial corpus, we collected 1,559 task descriptions from real-world scenarios. In addition, we designed a systematic evaluation framework that integrates assessments from both large language models (LLMs) and human reviewers. Experimental results show that our framework improves task execution success rates by 8.5% over state-of-the-art baselines. Moreover, the generated tutorials demonstrate superior readability and instructional effectiveness, often approaching or surpassing expert-authored materials. Importantly, the automated pipeline eliminates manual labor and reduces time costs to 1/20 of expert authoring, making scalable and high-quality tutorial generation practical for the first time.
Chinese: 本文提出了首个从自然语言描述自动生成Excel教程的框架,将任务执行成功率提升8.5%,制作时间缩短至人工的1/20,同时生成质量达到甚至超越专家编写水平。
English: This paper introduces the first automated framework for generating Excel tutorials from natural language descriptions, which enhances task success rates by 8.5% and reduces production time to 1/20 of manual efforts while maintaining quality comparable to expert-authored materials.

Authors:Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan
Title: Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
Abstract:
The increasing autonomy of LLM agents in handling sensitive communications, accelerated by Model Context Protocol (MCP) and Agent-to-Agent (A2A) frameworks, creates urgent privacy challenges. While recent work reveals significant gaps between LLMs' privacy Q&A performance and their agent behavior, existing benchmarks remain limited to static, simplified scenarios. We present PrivacyChecker, a model-agnostic, contextual integrity based mitigation approach that effectively reduces privacy leakage from 36.08% to 7.30% on DeepSeek-R1 and from 33.06% to 8.32% on GPT-4o, all while preserving task helpfulness. We also introduce PrivacyLens-Live, transforming static benchmarks into dynamic MCP and A2A environments that reveal substantially higher privacy risks in practical. Our modular mitigation approach integrates seamlessly into agent protocols through three deployment strategies, providing practical privacy protection for the emerging agentic ecosystem. Our data and code will be made available at https://aka.ms/privacy_in_action.
中文: PrivacyChecker将LLM代理的隐私泄露率从30%以上有效降至9%以下且不影响任务性能,而PrivacyLens-Live通过将静态基准转化为动态环境,揭示了实际应用中更高的隐私风险。
English: PrivacyChecker effectively reduces privacy leakage in LLM agents from over 30% to under 9% while maintaining task performance, and PrivacyLens-Live transforms static benchmarks into dynamic environments to reveal higher real-world privacy risks.

Authors:Qibin Wang, Pu Zhao, Shaohan Huang, Fangkai Yang, Lu Wang, Furu Wei, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Title: Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs
Abstract:
To further enhance the ability of Large Language Models (LLMs) to solve complex, multi-step reasoning problems, test-time scaling (TTS) methods have gained widespread attention. Existing approaches such as Best-of-N and majority voting are limited as their performance depends on the quality of candidate responses, making them unable to produce a correct solution when all candidates are incorrect. Introducing an additional model to select the best response also incurs significant deployment costs. To this end, we introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework where a unified model first generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution based on a prompt consisting of the problem and these candidates. However, LLMs struggle to perform refinement effectively when prompted directly. Therefore, we design a hybrid training pipeline by jointly optimizing for two complementary objectives, solving problems directly and refining candidate responses. Experimental results demonstrate that our method achieves state-of-the-art performance across five mathematical benchmarks. We further show that this learned self-refinement skill is a model-agnostic enhancement, robust across different model scales and generalizing to out-of-distribution reasoning tasks.
中文摘要:针对现有测试时扩展方法的局限,我们提出生成式自我优化框架,通过统一模型并行生成候选答案并基于提示进行自我优化合成更优解,在多个数学基准测试中取得最优性能,且该优化能力具备模型无关性和任务泛化性。
English Summary: To overcome the limitations of existing test-time scaling methods, we propose Generative Self-Refinement (GSR), a parallel framework where a unified model generates candidate responses and synthesizes a superior solution through self-refinement, achieving state-of-the-art performance across mathematical benchmarks.

Authors:Weilin Zhao, Zihan Zhou, Zhou Su, Chaojun Xiao, Yuxuan Li, Yanghao Li, Yudi Zhang, Weilun Zhao, Zhen Li, Yuxiang Huang, Ao Sun, Xu Han, Zhiyuan Liu
Title: InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation
Abstract:
Long-sequence processing is a critical capability for modern large language models. However, the self-attention mechanism in the standard Transformer architecture faces severe computational and memory bottlenecks when processing long sequences. While trainable sparse attention methods offer a promising solution, existing approaches such as NSA introduce excessive extra parameters and disrupt the conventional \textit{pretrain-on-short, finetune-on-long} workflow, resulting in slow convergence and difficulty in acceleration. To overcome these limitations, we introduce dense-sparse switchable attention framework, termed as InfLLM-V2. InfLLM-V2 is a trainable sparse attention that seamlessly adapts models from short to long sequences. Specifically, InfLLM-V2 reuses dense attention parameters through parameter-free architecture modification, maintaining consistency between short and long sequence processing. Additionally, InfLLM-V2 ensures computational efficiency across all sequence lengths, by using dense attention for short inputs and smoothly transitioning to sparse attention for long sequences. To achieve practical acceleration, we further introduce an efficient implementation of InfLLM-V2 that significantly reduces the computational overhead. Our experiments on long-context understanding and chain-of-thought reasoning demonstrate that InfLLM-V2 is 4$\times$ faster than dense attention while retaining 98.1% and 99.7% of the performance, respectively. Based on the InfLLM-V2 framework, we have trained and open-sourced MiniCPM4.1 (https://huggingface.co/openbmb/MiniCPM4.1-8B), a hybrid reasoning model, providing a reproducible implementation for the research community.
中文: InfLLM-V2提出了一种稠密-稀疏可切换注意力框架,通过复用稠密注意力参数并针对长序列平滑切换至稀疏注意力,实现了高效的长序列处理,在保持高性能的同时提速高达4倍。
English: InfLLM-V2 introduces a dense-sparse switchable attention framework that enables efficient long-sequence processing by reusing dense attention parameters and transitioning to sparse attention for longer inputs, achieving up to 4x speedup while maintaining high performance.

Authors:Xin Wang, Jie Li, Zejia Weng, Yixu Wang, Yifeng Gao, Tianyu Pang, Chao Du, Yan Teng, Yingchun Wang, Zuxuan Wu, Xingjun Ma, Yu-Gang Jiang
Title: FreezeVLA: Action-Freezing Attacks against Vision-Language-Action Models
Abstract:
Vision-Language-Action (VLA) models are driving rapid progress in robotics by enabling agents to interpret multimodal inputs and execute complex, long-horizon tasks. However, their safety and robustness against adversarial attacks remain largely underexplored. In this work, we identify and formalize a critical adversarial vulnerability in which adversarial images can "freeze" VLA models and cause them to ignore subsequent instructions. This threat effectively disconnects the robot's digital mind from its physical actions, potentially inducing inaction during critical interventions. To systematically study this vulnerability, we propose FreezeVLA, a novel attack framework that generates and evaluates action-freezing attacks via min-max bi-level optimization. Experiments on three state-of-the-art VLA models and four robotic benchmarks show that FreezeVLA attains an average attack success rate of 76.2%, significantly outperforming existing methods. Moreover, adversarial images generated by FreezeVLA exhibit strong transferability, with a single image reliably inducing paralysis across diverse language prompts. Our findings expose a critical safety risk in VLA models and highlight the urgent need for robust defense mechanisms.
中文: 视觉-语言-动作模型存在严重安全漏洞,对抗性图像可使其"冻结"并忽略后续指令,FreezeVLA攻击框架在多个基准测试中达到76.2%的平均成功率,揭示了该模型在物理干预中的潜在瘫痪风险。
English: Vision-Language-Action models face a critical security vulnerability where adversarial images can freeze them into ignoring instructions, as demonstrated by the FreezeVLA attack achieving 76.2% success rate across multiple benchmarks.

Authors:Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
Title: Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue
Abstract:
The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.
中文: Ask-to-Clarify框架通过多轮对话解决模糊指令并端到端生成动作,在现实任务中超越了现有最先进的视觉语言模型方法。
English: The Ask-to-Clarify framework enables embodied agents to resolve ambiguous instructions through multi-turn dialogue and generate actions end-to-end, outperforming existing methods in real-world tasks.

Authors:Mingqing Zhang, Zhuoning Xu, Peijie Wang, Rongji Li, Liang Wang, Qiang Liu, Jian Xu, Xuyao Zhang, Shu Wu, Liang Wang
Title: AgriDoctor: A Multimodal Intelligent Assistant for Agriculture
Abstract:
Accurate crop disease diagnosis is essential for sustainable agriculture and global food security. Existing methods, which primarily rely on unimodal models such as image-based classifiers and object detectors, are limited in their ability to incorporate domain-specific agricultural knowledge and lack support for interactive, language-based understanding. Recent advances in large language models (LLMs) and large vision-language models (LVLMs) have opened new avenues for multimodal reasoning. However, their performance in agricultural contexts remains limited due to the absence of specialized datasets and insufficient domain adaptation. In this work, we propose AgriDoctor, a modular and extensible multimodal framework designed for intelligent crop disease diagnosis and agricultural knowledge interaction. As a pioneering effort to introduce agent-based multimodal reasoning into the agricultural domain, AgriDoctor offers a novel paradigm for building interactive and domain-adaptive crop health solutions. It integrates five core components: a router, classifier, detector, knowledge retriever and LLMs. To facilitate effective training and evaluation, we construct AgriMM, a comprehensive benchmark comprising 400000 annotated disease images, 831 expert-curated knowledge entries, and 300000 bilingual prompts for intent-driven tool selection. Extensive experiments demonstrate that AgriDoctor, trained on AgriMM, significantly outperforms state-of-the-art LVLMs on fine-grained agricultural tasks, establishing a new paradigm for intelligent and sustainable farming applications.
中文: 该摘要提出AgriDoctor框架,通过整合视觉分析与农业知识构建模块化多模态作物病害诊断系统,并基于AgriMM基准测试证明其在精细农业任务中显著优于现有先进模型。
English: This abstract introduces AgriDoctor, a modular multimodal framework that enhances crop disease diagnosis by integrating visual analysis with agricultural knowledge, supported by the newly developed AgriMM benchmark, which significantly outperforms existing models in agricultural tasks.

Authors:Zhilun Zhou, Jing Yi Wang, Nicholas Sukiennik, Chen Gao, Fengli Xu, Yong Li, James Evans
Title: Rationality Check! Benchmarking the Rationality of Large Language Models
Abstract:
Large language models (LLMs), a recent advance in deep learning and machine intelligence, have manifested astonishing capacities, now considered among the most promising for artificial general intelligence. With human-like capabilities, LLMs have been used to simulate humans and serve as AI assistants across many applications. As a result, great concern has arisen about whether and under what circumstances LLMs think and behave like real human agents. Rationality is among the most important concepts in assessing human behavior, both in thinking (i.e., theoretical rationality) and in taking action (i.e., practical rationality). In this work, we propose the first benchmark for evaluating the omnibus rationality of LLMs, covering a wide range of domains and LLMs. The benchmark includes an easy-to-use toolkit, extensive experimental results, and analysis that illuminates where LLMs converge and diverge from idealized human rationality. We believe the benchmark can serve as a foundational tool for both developers and users of LLMs.
中文: 大型语言模型展现出类人能力,但其理性备受关注,为此开发了首个综合性基准,用于评估其在多领域与理想人类理性的契合度。
English: Large language models demonstrate human-like capabilities but raise concerns about their rationality, leading to the creation of a comprehensive benchmark to evaluate their alignment with idealized human reasoning across various domains.

Authors:Yue Ding, Xiaofang Zhu, Tianze Xia, Junfei Wu, Xinlong Chen, Qiang Liu, Liang Wang
Title: D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs
Abstract:
Although large Language Models (LLMs) have achieved remarkable success, their practical application is often hindered by the generation of non-factual content, which is called "hallucination". Ensuring the reliability of LLMs' outputs is a critical challenge, particularly in high-stakes domains such as finance, security, and healthcare. In this work, we revisit hallucination detection from the perspective of model architecture and generation dynamics. Leveraging the multi-layer structure and autoregressive decoding process of LLMs, we decompose hallucination signals into two complementary dimensions: the semantic breadth of token representations within each layer, and the semantic depth of core concepts as they evolve across layers. Based on this insight, we propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)}, a training-free and label-free framework that jointly measures: (1) \textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of token representations within each layer; and (2) \textbf{Inter-Layer Drift}, which tracks the progressive transformation of key token representations across layers. To ensure drift reflects the evolution of meaningful semantics rather than noisy or redundant tokens, we guide token selection using attention signals. By capturing both the horizontal and vertical dynamics of representation during inference, D$^2$HScore provides an interpretable and lightweight proxy for hallucination detection. Extensive experiments across five open-source LLMs and five widely used benchmarks demonstrate that D$^2$HScore consistently outperforms existing training-free baselines.
中文摘要:本文提出D²HScore框架,通过分析表征层内离散度和层间漂移来检测大语言模型的幻觉现象,无需训练即可在多个基准测试中优于现有方法。
English Summary: This paper introduces D²HScore, a training-free framework that detects hallucinations in large language models by analyzing intra-layer dispersion and inter-layer drift of token representations, demonstrating superior performance across multiple benchmarks.

Authors:Alexander Spiridonov, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel
Title: Generalist Robot Manipulation beyond Action Labeled Data
Abstract:
Recent advances in generalist robot manipulation leverage pre-trained Vision-Language Models (VLMs) and large-scale robot demonstrations to tackle diverse tasks in a zero-shot manner. A key challenge remains: scaling high-quality, action-labeled robot demonstration data, which existing methods rely on for robustness and generalization. To address this, we propose a method that benefits from videos without action labels - featuring humans and/or robots in action - enhancing open-vocabulary performance and enabling data-efficient learning of new tasks. Our method extracts dense, dynamic 3D point clouds at the hand or gripper location and uses a proposed 3D dynamics predictor for self-supervision. This predictor is then tuned to an action predictor using a smaller labeled dataset for action alignment. We show that our method not only learns from unlabeled human and robot demonstrations - improving downstream generalist robot policies - but also enables robots to learn new tasks without action labels (i.e., out-of-action generalization) in both real-world and simulated settings.
中文摘要:通用机器人操作的最新进展利用预训练的视觉语言模型和大规模演示数据实现零样本任务处理,但高质量动作标注数据的扩展仍是关键挑战;我们提出的方法通过从无标签视频中提取密集动态3D点云并进行自监督动态预测,不仅提升了机器人策略性能,还实现了在真实与仿真环境中无需动作标注的新任务学习能力。
English Summary: Recent progress in generalist robot manipulation uses Vision-Language Models and large-scale demonstrations for zero-shot task handling, but faces challenges in scaling high-quality labeled data; our method overcomes this by learning from unlabeled videos through 3D point cloud extraction and self-supervised dynamics prediction, improving robot policies and enabling label-free task learning in real and simulated environments.

Authors:Xu Zheng, Chenfei Liao, Ziqiao Weng, Kaiyu Lei, Zihao Dongfang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Lu Qi, Li Chen, Danda Pani Paudel, Kailun Yang, Linfeng Zhang, Luc Van Gool, Xuming Hu
Title: PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
Abstract:
Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.
中文: 在具身AI时代,全向视觉技术迅速发展,通过生成、感知和理解等领域的突破提供全面的环境感知和决策支持,但构建鲁棒的全向AI系统仍面临未来挑战。
English: Omnidirectional vision is advancing rapidly in the embodied AI era, offering comprehensive environmental perception and enhanced decision-making through recent breakthroughs in generation, perception, and understanding, while facing future challenges in developing robust AI systems.

Authors:Qinglin Wang, Zhihong Sun, Ruyun Wang, Tao Huang, Zhi Jin, Ge Li, Chen Lyu
Title: SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code
Abstract:
Large Language Models (LLMs) can translate natural language requirements into code, yet empirical analyses of representative models reveal that semantic errors-programs that compile but behave incorrectly-constitute the majority of observed faults (e.g., >60% on DeepSeek-Coder-6.7B and QwenCoder-7B). Post-hoc repair pipelines detect such faults only after execution, incurring latency, relying on incomplete test suites, and often mis-localizing the defect. Since semantic drift originates in the autoregressive decoding process, intervening while the code is being generated is a direct way to stop error propagation. Constrained-decoding approaches such as ROCODE attempt this, but still wait until the entire program runs to obtain feedback and use entropy heuristics that do not truly capture semantics. A more effective solution must inject semantic signals-early and precisely-into the decoding process.We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision. To train the evaluator, we build SemDiff, the first dataset with fine-grained annotations that mark the exact line where a correct and an incorrect implementation diverge. The evaluator, once embedded in the LLM's decoder, flags deviations on partial code, rolls back to the faulty line, and guides regeneration-without executing the program or requiring test cases. Across four benchmarks, SemGuard consistently outperforms state-of-the-art baselines. It lowers the semantic error rate by 19.86% on SemDiff relative to ROCODE, and lifts Pass@1 by 48.92% on the real-world LiveCodeBench with CodeLlama-7B. Similar gains hold for StarCoder2-7B on MBPP and for DeepSeekCoder-6.7B on the Java benchmark SemDiff-Java, demonstrating model- and language-agnostic effectiveness.
中文: 大语言模型生成的代码常存在语义错误,虽能编译但运行不正确;提出的SemGuard框架通过在解码过程中实施实时、行级语义监督,无需程序执行或测试用例即可有效降低错误率。
English: Large Language Models often generate code with semantic errors that compile but behave incorrectly, and the proposed SemGuard framework addresses this by providing real-time, line-level semantic supervision during decoding to reduce errors without needing program execution or test cases.

Authors:Xiaobo Xing, Wei Yuan, Tong Chen, Quoc Viet Hung Nguyen, Xiangliang Zhang, Hongzhi Yin
Title: TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding
Abstract:
Modeling semantic and structural information from tabular data remains a core challenge for effective table understanding. Existing Table-as-Text approaches flatten tables for large language models (LLMs), but lose crucial structural cues, while Table-as-Image methods preserve structure yet struggle with fine-grained semantics. Recent Table-as-Multimodality strategies attempt to combine textual and visual views, but they (1) statically process both modalities for every query-table pair within a large multimodal LLMs (MLLMs), inevitably introducing redundancy and even conflicts, and (2) depend on costly fine-tuning of MLLMs. In light of this, we propose TableDART, a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models. TableDART introduces a lightweight 2.59M-parameter MLP gating network that dynamically selects the optimal path (either Text-only, Image-only, or Fusion) for each table-query pair, effectively reducing redundancy and conflicts from both modalities. In addition, we propose a novel agent to mediate cross-modal knowledge integration by analyzing outputs from text- and image-based models, either selecting the best result or synthesizing a new answer through reasoning. This design avoids the prohibitive costs of full MLLM fine-tuning. Extensive experiments on seven benchmarks show that TableDART establishes new state-of-the-art performance among open-source models, surpassing the strongest baseline by an average of 4.02%. The code is available at: https://anonymous.4open.science/r/TableDART-C52B
中文摘要:TableDART提出了一种训练高效的框架,通过轻量级门控网络动态选择表格-查询对的最佳多模态路径,并协调跨模态知识整合,在不进行昂贵多模态大模型微调的情况下实现了最先进的性能。
English Summary: TableDART is a training-efficient framework that dynamically selects optimal multimodal paths for table-query pairs using a lightweight gating network and mediates cross-modal knowledge integration, achieving state-of-the-art performance without costly MLLM fine-tuning.

Authors:Ziliang Wang, Ge Li, Jia Li, Hao Zhu, Zhi Jin
Title: VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection
Abstract:
The application of language models to project-level vulnerability detection remains challenging, owing to the dual requirement of accurately localizing security-sensitive code and correctly correlating and reasoning over complex program context. We present VulAgent, a multi-agent vulnerability detection framework based on hypothesis validation. Our design is inspired by how human auditors review code: when noticing a sensitive operation, they form a hypothesis about a possible vulnerability, consider potential trigger paths, and then verify the hypothesis against the surrounding context. VulAgent implements a semantics-sensitive, multi-view detection pipeline: specialized agents, each aligned to a specific analysis perspective (e.g., memory, authorization), collaboratively surface and precisely localize sensitive code sites with higher coverage. Building on this, VulAgent adopts a hypothesis-validation paradigm: for each vulnerability report, it builds hypothesis conditions and a trigger path, steering the LLM to target the relevant program context and defensive checks during verification, which reduces false positives. On average across the two datasets, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450% (246% on average), and reduces the false positive rate by about 36% compared with state-of-the-art LLM-based baselines.
VulAgent是一种基于假设验证的多智能体漏洞检测框架,通过模拟人工代码审查过程,显著提高了检测精度并大幅降低了误报率。
VulAgent is a multi-agent framework that enhances vulnerability detection by simulating human auditing through hypothesis validation, significantly improving accuracy and reducing false positives compared to existing methods.

Authors:Jing Long, Sirui Huang, Huan Huo, Tong Chen, Hongzhi Yin, Guandong Xu
Title: Cloud-Device Collaborative Agents for Sequential Recommendation
Abstract:
Recent advances in large language models (LLMs) have enabled agent-based recommendation systems with strong semantic understanding and flexible reasoning capabilities. While LLM-based agents deployed in the cloud offer powerful personalization, they often suffer from privacy concerns, limited access to real-time signals, and scalability bottlenecks. Conversely, on-device agents ensure privacy and responsiveness but lack the computational power for global modeling and large-scale retrieval. To bridge these complementary limitations, we propose CDA4Rec, a novel Cloud-Device collaborative framework for sequential Recommendation, powered by dual agents: a cloud-side LLM and a device-side small language model (SLM). CDA4Rec tackles the core challenge of cloud-device coordination by decomposing the recommendation task into modular sub-tasks including semantic modeling, candidate retrieval, structured user modeling, and final ranking, which are allocated to cloud or device based on computational demands and privacy sensitivity. A strategy planning mechanism leverages the cloud agent's reasoning ability to generate personalized execution plans, enabling context-aware task assignment and partial parallel execution across agents. This design ensures real-time responsiveness, improved efficiency, and fine-grained personalization, even under diverse user states and behavioral sparsity. Extensive experiments across multiple real-world datasets demonstrate that CDA4Rec consistently outperforms competitive baselines in both accuracy and efficiency, validating its effectiveness in heterogeneous and resource-constrained environments.
中文: CDA4Rec提出了一种云-端协作框架,采用云端大语言模型与端侧小语言模型的双智能体设计,通过模块化任务分解和个性化执行策略,在保障隐私与实时响应的同时提升序列推荐的计算效率和准确性。
English: CDA4Rec introduces a cloud-device collaborative framework using dual agents—a cloud-based LLM and an on-device SLM—to enhance sequential recommendations by balancing computational power, privacy, and real-time responsiveness through modular task decomposition and personalized execution planning.

Authors:Wei Jiang, Tong Chen, Wei Yuan, Xiangyu Zhao, Quoc Viet Hung Nguyen, Hongzhi Yin
Title: Towards Propagation-aware Representation Learning for Supervised Social Media Graph Analytics
Abstract:
Social media platforms generate vast, complex graph-structured data, facilitating diverse tasks such as rumor detection, bot identification, and influence modeling. Real-world applications like public opinion monitoring and stock trading -- which have a strong attachment to social media -- demand models that are performant across diverse tasks and datasets. However, most existing solutions are purely data-driven, exhibiting vulnerability to the inherent noise within social media data. Moreover, the reliance on task-specific model design challenges efficient reuse of the same model architecture on different tasks, incurring repetitive engineering efforts. To address these challenges in social media graph analytics, we propose a general representation learning framework that integrates a dual-encoder structure with a kinetic-guided propagation module. In addition to jointly modeling structural and contextual information with two encoders, our framework innovatively captures the information propagation dynamics within social media graphs by integrating principled kinetic knowledge. By deriving a propagation-aware encoder and corresponding optimization objective from a Markov chain-based transmission model, the representation learning pipeline receives a boost in its robustness to noisy data and versatility in diverse tasks. Extensive experiments verify that our approach achieves state-of-the-art performance with a unified architecture on a variety of social media graph mining tasks spanning graph classification, node classification, and link prediction. Besides, our solution exhibits strong zero-shot and few-shot transferability across datasets, demonstrating practicality when handling data-scarce tasks.
中文摘要:该框架通过双编码器与动力学引导模块的结合,提升了社交媒体图分析对噪声的鲁棒性和多任务适应性,实现了最优性能并展现出强大的跨数据集迁移能力。
English Summary: The proposed framework integrates dual encoders with a kinetic-guided module to enhance robustness against noise and versatility across social media graph tasks, achieving state-of-the-art performance and strong transferability.

Authors:Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, Vincent Y. F. Tan
Title: Muon Outperforms Adam in Tail-End Associative Memory Learning
Abstract:
The Muon optimizer is consistently faster than Adam in training Large Language Models (LLMs), yet the mechanism underlying its success remains unclear. This paper demystifies this mechanism through the lens of associative memory. By ablating the transformer components optimized by Muon, we reveal that the associative memory parameters of LLMs, namely the Value and Output (VO) attention weights and Feed-Forward Networks (FFNs), are the primary contributors to Muon's superiority. Motivated by this associative memory view, we then explain Muon's superiority on real-world corpora, which are intrinsically heavy-tailed: a few classes (tail classes) appear far less frequently than others. The superiority is explained through two key properties: (i) its update rule consistently yields a more isotropic singular spectrum than Adam; and as a result, (ii) on heavy-tailed data, it optimizes tail classes more effectively than Adam. Beyond empirical evidence, we theoretically confirm these findings by analyzing a one-layer associative memory model under class-imbalanced data. We prove that Muon consistently achieves balanced learning across classes regardless of feature embeddings, whereas Adam can induce large disparities in learning errors depending on embedding properties. In summary, our empirical observations and theoretical analyses reveal Muon's core advantage: its update rule aligns with the outer-product structure of linear associative memories, enabling more balanced and effective learning of tail classes in heavy-tailed distributions than Adam.
Muon优化器通过其与联想记忆结构对齐的各向同性更新规则,在重尾数据中实现更均衡的尾部类别学习,从而在训练大语言模型时优于Adam。
Muon optimizer outperforms Adam in training LLMs by enabling more balanced learning of tail classes in heavy-tailed data through its isotropic update rule aligned with associative memory structures.

Authors:Ziniu Li, Congliang Chen, Tianyun Yang, Tian Ding, Ruoyu Sun, Ge Zhang, Wenhao Huang, Zhi-Quan Luo
Title: Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation
Abstract:
Large Language Models (LLMs) can self-improve through reinforcement learning, where they generate trajectories to explore and discover better solutions. However, this exploration process is computationally expensive, often forcing current methods to assign limited exploration budgets to each task. This uniform allocation creates problematic edge cases: easy tasks consistently succeed while difficult tasks consistently fail, both producing zero gradients during training updates for the widely used Group Relative Policy Optimization (GRPO). We address this problem from the lens of exploration budget allocation. Viewing each task's exploration as an "item" with a distinct "value" and "cost", we establish a connection to the classical knapsack problem. This formulation allows us to derive an optimal assignment rule that adaptively distributes resources based on the model's current learning status. When applied to GRPO, our method increases the effective ratio of non-zero policy gradients by 20-40% during training. Acting as a computational "free lunch", our approach could reallocate exploration budgets from tasks where learning is saturated to those where it is most impactful. This enables significantly larger budgets (e.g., 93 rollouts) for especially challenging problems, which would be computationally prohibitive under a uniform allocation. These improvements translate to meaningful gains on mathematical reasoning benchmarks, with average improvements of 2-4 points and peak gains of 9 points on specific tasks. Notably, achieving comparable performance with traditional homogeneous allocation would require about 2x the computational resources.
中文: 我们的方法将任务视为背包问题中的物品,优化大语言模型的探索预算分配,动态地将资源从学习饱和的任务转移到关键任务,使非零策略梯度比例提高20-40%,在推理基准上实现显著性能提升,同时节省一半计算资源。
English: Our method optimizes exploration budget allocation in LLMs by treating tasks as items in a knapsack problem, dynamically shifting resources from saturated to impactful tasks, which boosts non-zero policy gradients by 20-40% and yields significant performance gains on reasoning benchmarks while halving computational costs.

Authors:Xuan He, Dongfu Jiang, Ping Nie, Minghao Liu, Zhengxuan Jiang, Mingyi Su, Wentao Ma, Junru Lin, Chun Ye, Yi Lu, Keming Wu, Benjamin Schneider, Quy Duc Do, Zhuofeng Li, Yiming Jia, Yuxuan Zhang, Guo Cheng, Haozhe Wang, Wangchunshu Zhou, Qunshu Lin, Yuanxing Zhang, Ge Zhang, Wenhao Huang, Wenhu Chen
Title: VideoScore2: Think before You Score in Generative Video Evaluation
Abstract:
Recent advances in text-to-video generation have produced increasingly realistic and diverse content, yet evaluating such videos remains a fundamental challenge due to their multi-faceted nature encompassing visual quality, semantic alignment, and physical consistency. Existing evaluators and reward models are limited to single opaque scores, lack interpretability, or provide only coarse analysis, making them insufficient for capturing the comprehensive nature of video quality assessment. We present VideoScore2, a multi-dimensional, interpretable, and human-aligned framework that explicitly evaluates visual quality, text-to-video alignment, and physical/common-sense consistency while producing detailed chain-of-thought rationales. Our model is trained on a large-scale dataset VideoFeedback2 containing 27,168 human-annotated videos with both scores and reasoning traces across three dimensions, using a two-stage pipeline of supervised fine-tuning followed by reinforcement learning with Group Relative Policy Optimization (GRPO) to enhance analytical robustness. Extensive experiments demonstrate that VideoScore2 achieves superior performance with 44.35 (+5.94) accuracy on our in-domain benchmark VideoScore-Bench-v2 and 50.37 (+4.32) average performance across four out-of-domain benchmarks (VideoGenReward-Bench, VideoPhy2, etc), while providing interpretable assessments that bridge the gap between evaluation and controllable generation through effective reward modeling for Best-of-N sampling. Project Page: https://tiger-ai-lab.github.io/VideoScore2/
Chinese Summary: VideoScore2是一个多维可解释的视频评估框架,通过三阶段维度评估和强化学习训练,在多个基准测试中表现优异,并为可控生成提供可解释的评估依据。
English Summary: VideoScore2 is a multi-dimensional and interpretable framework that evaluates text-to-video generation across visual quality, semantic alignment, and physical consistency, achieving superior performance on benchmarks while providing detailed rationales.

Authors:Jiawei Zhao, Yuang Qi, Weiming Zhang, Nenghai Yu, Kejiang Chen
Title: PSRT: Accelerating LRM-based Guard Models via Prefilled Safe Reasoning Traces
Abstract:
Large Reasoning Models (LRMs) have demonstrated remarkable performance on tasks such as mathematics and code generation. Motivated by these strengths, recent work has empirically demonstrated the effectiveness of LRMs as guard models in improving harmful query detection. However, LRMs typically generate long reasoning traces during inference, causing substantial computational overhead. In this paper, we introduce PSRT, a method that replaces the model's reasoning process with a Prefilled Safe Reasoning Trace, thereby significantly reducing the inference cost of LRMs. Concretely, PSRT prefills "safe reasoning virtual tokens" from a constructed dataset and learns over their continuous embeddings. With the aid of indicator tokens, PSRT enables harmful-query detection in a single forward pass while preserving the classification effectiveness of LRMs. We evaluate PSRT on 7 models, 13 datasets, and 8 jailbreak methods. In terms of efficiency, PSRT completely removes the overhead of generating reasoning tokens during inference. In terms of classification performance, PSRT achieves nearly identical accuracy, with only a minor average F1 drop of 0.015 across 7 models and 5 datasets.
中文摘要:PSRT是一种通过预填充安全推理轨迹的新方法,在保持大推理模型有害查询检测性能基本不变的同时,显著降低了其计算开销。
English Summary: PSRT is a novel method that prefills safe reasoning traces to significantly reduce the computational overhead of Large Reasoning Models while maintaining nearly identical harmful-query detection performance.

Authors:Xingkai Peng, Jun Jiang, Meng Tong, Shuai Li, Weiming Zhang, Nenghai Yu, Kejiang Chen
Title: Multimodal Prompt Decoupling Attack on the Safety Filters in Text-to-Image Models
Abstract:
Text-to-image (T2I) models have been widely applied in generating high-fidelity images across various domains. However, these models may also be abused to produce Not-Safe-for-Work (NSFW) content via jailbreak attacks. Existing jailbreak methods primarily manipulate the textual prompt, leaving potential vulnerabilities in image-based inputs largely unexplored. Moreover, text-based methods face challenges in bypassing the model's safety filters. In response to these limitations, we propose the Multimodal Prompt Decoupling Attack (MPDA), which utilizes image modality to separate the harmful semantic components of the original unsafe prompt. MPDA follows three core steps: firstly, a large language model (LLM) decouples unsafe prompts into pseudo-safe prompts and harmful prompts. The former are seemingly harmless sub-prompts that can bypass filters, while the latter are sub-prompts with unsafe semantics that trigger filters. Subsequently, the LLM rewrites the harmful prompts into natural adversarial prompts to bypass safety filters, which guide the T2I model to modify the base image into an NSFW output. Finally, to ensure semantic consistency between the generated NSFW images and the original unsafe prompts, the visual language model generates image captions, providing a new pathway to guide the LLM in iterative rewriting and refining the generated content.
中文: 多模态提示解耦攻击(MPDA)利用图像输入规避文生图模型的安全过滤器,通过将有害提示分解为伪安全与对抗性组件,并借助迭代优化确保语义一致性。
English: The Multimodal Prompt Decoupling Attack (MPDA) leverages image inputs to bypass safety filters in text-to-image models by decoupling harmful prompts into pseudo-safe and adversarial components, ensuring semantic consistency through iterative refinement.

Authors:Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, Leszek Rutkowski, Mang Ye, Bo Du, Dacheng Tao
Title: MAPO: Mixed Advantage Policy Optimization
Abstract:
Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.
Chinese: 本文提出混合优势策略优化(MAPO),通过基于轨迹确定性的动态优势函数调整,解决了优势反转和镜像问题,经与先进方法对比及消融实验验证了其有效性。
English: This paper introduces Mixed Advantage Policy Optimization (MAPO), an enhanced Group Relative Policy Optimization strategy that addresses advantage reversion and mirroring issues by dynamically adjusting the advantage function based on trajectory certainty, validated as effective through comparisons and ablation studies.

Authors:Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang
Title: FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
Abstract:
Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.
中文:FinSearchComp是首个针对现实金融搜索与推理任务的开源基准测试,通过专家标注的时效性数据检索和复杂历史调查问题,全面评估智能体在金融市场中的专业表现。
English: FinSearchComp is the first open-source benchmark designed to evaluate LLM-based agents on realistic financial search and reasoning tasks, featuring expert-annotated questions that test time-sensitive data retrieval and complex investigations across global markets.

Authors:Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Wei Ye, Tong Yang, Wenhao Huang, Ge Zhang, Fangzhen Lin
Title: Reverse-Engineered Reasoning for Open-Ended Generation
Abstract:
While the ``deep reasoning'' paradigm has spurred significant advances in verifiable domains like mathematics, its application to open-ended, creative generation remains a critical challenge. The two dominant methods for instilling reasoning -- reinforcement learning (RL) and instruction distillation -- falter in this area; RL struggles with the absence of clear reward signals and high-quality reward models, while distillation is prohibitively expensive and capped by the teacher model's capabilities. To overcome these limitations, we introduce REverse-Engineered Reasoning (REER), a new paradigm that fundamentally shifts the approach. Instead of building a reasoning process ``forwards'' through trial-and-error or imitation, REER works ``backwards'' from known-good solutions to computationally discover the latent, step-by-step deep reasoning process that could have produced them. Using this scalable, gradient-free approach, we curate and open-source DeepWriting-20K, a large-scale dataset of 20,000 deep reasoning trajectories for open-ended tasks. Our model, DeepWriter-8B, trained on this data, not only surpasses strong open-source baselines but also achieves performance competitive with, and at times superior to, leading proprietary models like GPT-4o and Claude 3.5.
Chinese: REER范式通过从已知解决方案反向推导逐步推理逻辑,构建了DeepWriting-20K数据集和DeepWriter-8B模型,其性能不仅超越开源基线,更可与GPT-4o和Claude 3.5等顶尖专有模型相媲美甚至更优。
English: The REER paradigm reverses the reasoning process by deriving step-by-step logic from known solutions, enabling the creation of DeepWriting-20K dataset and DeepWriter-8B model that outperform open-source models and rival top proprietary ones like GPT-4o and Claude 3.5.

Authors:Xin Lin, Xian Ge, Dizhe Zhang, Zhaoliang Wan, Xianshun Wang, Xiangtai Li, Wenjie Jiang, Bo Du, Dacheng Tao, Ming-Hsuan Yang, Lu Qi
Title: One Flight Over the Gap: A Survey from Perspective to Panoramic Vision
Abstract:
Driven by the demand for spatial intelligence and holistic scene perception, omnidirectional images (ODIs), which provide a complete 360\textdegree{} field of view, are receiving growing attention across diverse applications such as virtual reality, autonomous driving, and embodied robotics. Despite their unique characteristics, ODIs exhibit remarkable differences from perspective images in geometric projection, spatial distribution, and boundary continuity, making it challenging for direct domain adaption from perspective methods. This survey reviews recent panoramic vision techniques with a particular emphasis on the perspective-to-panorama adaptation. We first revisit the panoramic imaging pipeline and projection methods to build the prior knowledge required for analyzing the structural disparities. Then, we summarize three challenges of domain adaptation: severe geometric distortions near the poles, non-uniform sampling in Equirectangular Projection (ERP), and periodic boundary continuity. Building on this, we cover 20+ representative tasks drawn from more than 300 research papers in two dimensions. On one hand, we present a cross-method analysis of representative strategies for addressing panoramic specific challenges across different tasks. On the other hand, we conduct a cross-task comparison and classify panoramic vision into four major categories: visual quality enhancement and assessment, visual understanding, multimodal understanding, and visual generation. In addition, we discuss open challenges and future directions in data, models, and applications that will drive the advancement of panoramic vision research. We hope that our work can provide new insight and forward looking perspectives to advance the development of panoramic vision technologies. Our project page is https://insta360-research-team.github.io/Survey-of-Panorama
中文摘要:本综述探讨全景视觉技术,重点分析如何将平面图像方法适配到全方位图像,通过解决几何畸变和边界连续性等特有挑战,对任务进行分类并展望未来研究方向。
English Summary: This survey examines panoramic vision techniques, focusing on adapting perspective methods to omnidirectional images by addressing unique challenges like geometric distortions and boundary continuity, while categorizing tasks and discussing future research directions.

Authors:Qinyan Zhang, Xinping Lei, Ruijie Miao, Yu Fu, Haojie Fan, Le Chang, Jiafan Hou, Dingling Zhang, Zhongfei Hou, Ziqiang Yang, Changxin Pu, Fei Hu, Jingkai Liu, Mengyun Liu, Yang Liu, Xiang Gao, Jiaheng Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang
Title: Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?
Abstract:
Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.
中文: 大语言模型常表现出认知惯性,难以遵循反直觉指令,因此提出逆向IFEval基准来评估其克服训练偏见和适应非常规情境的能力,强调未来对齐工作需在流畅性和准确性之外增强模型的适应性。
English: Large Language Models often struggle with cognitive inertia, failing to follow counter-intuitive instructions, so the Inverse IFEval benchmark is introduced to assess their ability to override training biases and adapt to unconventional contexts, highlighting the need for future alignment efforts to enhance adaptability beyond mere fluency and correctness.

Authors:Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, Wanjun Zhong, Yining Ye, Yujia Qin, Yuwen Xiong, Yuxin Song, Zhiyong Wu, Aoyan Li, Bo Li, Chen Dun, Chong Liu, Daoguang Zan, Fuxing Leng, Hanbin Wang, Hao Yu, Haobin Chen, Hongyi Guo, Jing Su, Jingjia Huang, Kai Shen, Kaiyu Shi, Lin Yan, Peiyao Zhao, Pengfei Liu, Qinghao Ye, Renjie Zheng, Shulin Xin, Wayne Xin Zhao, Wen Heng, Wenhao Huang, Wenqian Wang, Xiaobo Qin, Yi Lin, Youbin Wu, Zehui Chen, Zihao Wang, Baoquan Zhong, Xinchun Zhang, Xujing Li, Yuanfan Li, Zhongkai Zhao, Chengquan Jiang, Faming Wu, Haotian Zhou, Jinlin Pang, Li Han, Qi Liu, Qianli Ma, Siyao Liu, Songhua Cai, Wenqi Fu, Xin Liu, Yaohui Wang, Zhi Zhang, Bo Zhou, Guoliang Li, Jiajun Shi, Jiale Yang, Jie Tang, Li Li, Qihua Han, Taoran Lu, Woyu Lin, Xiaokang Tong, Xinyao Li, Yichi Zhang, Yu Miao, Zhengxuan Jiang, Zili Li, Ziyuan Zhao, Chenxin Li, Dehua Ma, Feng Lin, Ge Zhang, Haihua Yang, Hangyu Guo, Hongda Zhu, Jiaheng Liu, Junda Du, Kai Cai, Kuanye Li, Lichen Yuan, Meilan Han, Minchao Wang, Shuyue Guo, Tianhao Cheng, Xiaobo Ma, Xiaojun Xiao, Xiaolong Huang, Xinjie Chen, Yidi Du, Yilin Chen, Yiwen Wang, Zhaojian Li, Zhenzhu Yang, Zhiyuan Zeng, Chaolin Jin, Chen Li, Hao Chen, Haoli Chen, Jian Chen, Qinghao Zhao, Guang Shi
Title: UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Abstract:
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
中文: UI-TARS-2 是一种原生图形用户界面代理模型,通过系统性训练方法解决了数据可扩展性、多轮强化学习和环境稳定性等关键挑战,在图形界面基准测试中表现优异,并在游戏与长周期任务中展现出强大泛化能力。
English: UI-TARS-2 is a native GUI agent model that addresses key challenges in data scalability, multi-turn reinforcement learning, and environment stability, achieving superior performance on GUI benchmarks and competitive results in gaming and long-horizon tasks.

Authors:Ziyi Yang, Weizhou Shen, Ruijun Chen, Chenliang Li, Fanqi Wan, Ming Yan, Xiaojun Quan, Fei Huang
Title: SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models
Abstract:
Progress in long-context reasoning for large language models (LLMs) has lagged behind other recent advances. This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals. In this paper, we propose SPELL, a multi-role self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning. SPELL integrates three cyclical roles-questioner, responder, and verifier-within a single model to enable continual self-improvement. The questioner generates questions from raw documents paired with reference answers; the responder learns to solve these questions based on the documents; and the verifier evaluates semantic equivalence between the responder's output and the questioner's reference answer, producing reward signals to guide continual training. To stabilize training, we introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities. Extensive experiments on six long-context benchmarks show that SPELL consistently improves performance across diverse LLMs and outperforms equally sized models fine-tuned on large-scale annotated data. Notably, SPELL achieves an average 7.6-point gain in pass@8 on the strong reasoning model Qwen3-30B-A3B-Thinking, raising its performance ceiling and showing promise for scaling to even more capable models.
中文: 本文提出SPELL框架,通过将提问者、回答者和验证者角色集成于单一模型,实现无需标注的长上下文推理优化,在多项基准测试中显著提升模型性能。
English: The paper introduces SPELL, a self-play reinforcement learning framework that enables scalable, label-free optimization for long-context reasoning by integrating questioner, responder, and verifier roles in a single model, achieving significant performance gains across multiple benchmarks.

Authors:Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
Title: WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
Abstract:
This paper tackles open-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and one-shot generation paradigms that easily suffer from long-context failure issues like "loss in the middle" and hallucinations. To address these challenges, we introduce WebWeaver, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, source-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank for each part, it effectively mitigates long-context issues. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing high-quality, reliable, and well-structured reports.
中文摘要:本文提出WebWeaver双智能体框架,通过动态整合规划与证据收集的分层写作方法,有效解决开放式深度研究中的幻觉问题和引文不准难题,显著提升报告质量。
English Summary: This paper introduces WebWeaver, a dual-agent framework that addresses open-ended deep research challenges by dynamically integrating planning with evidence acquisition and hierarchical writing to enhance report quality and citation accuracy.

Authors:Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
Title: WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
Abstract:
This paper tackles \textbf{open-ended deep research (OEDR)}, a complex challenge where AI agents must synthesize vast web-scale information into insightful reports. Current approaches are plagued by dual-fold limitations: static research pipelines that decouple planning from evidence acquisition and monolithic generation paradigms that include redundant, irrelevant evidence, suffering from hallucination issues and low citation accuracy. To address these challenges, we introduce \textbf{WebWeaver}, a novel dual-agent framework that emulates the human research process. The planner operates in a dynamic cycle, iteratively interleaving evidence acquisition with outline optimization to produce a comprehensive, citation-grounded outline linking to a memory bank of evidence. The writer then executes a hierarchical retrieval and writing process, composing the report section by section. By performing targeted retrieval of only the necessary evidence from the memory bank via citations for each part, it effectively mitigates long-context issues and citation hallucinations. Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym. These results validate our human-centric, iterative methodology, demonstrating that adaptive planning and focused synthesis are crucial for producing comprehensive, trusted, and well-structured reports.
中文摘要:本文提出WebWeaver双智能体框架,通过动态整合规划与证据收集的分层写作方法,有效解决开放式深度研究中的幻觉问题和引文不准难题,显著提升报告质量。
English Summary: This paper introduces WebWeaver, a dual-agent framework that addresses open-ended deep research challenges by dynamically integrating planning with evidence acquisition and hierarchical writing to enhance report quality and citation accuracy.

Authors:An Guo, Shuoxiao Zhang, Enyi Tang, Xinyu Gao, Haomin Pang, Haoxiang Tian, Yanzhou Mu, Wu Wen, Chunrong Fang, Zhenyu Chen
Title: When Autonomous Vehicle Meets V2X Cooperative Perception: How Far Are We?
Abstract:
With the tremendous advancement of deep learning and communication technology, Vehicle-to-Everything (V2X) cooperative perception has the potential to address limitations in sensing distant objects and occlusion for a single-agent perception system. V2X cooperative perception systems are software systems characterized by diverse sensor types and cooperative agents, varying fusion schemes, and operation under different communication conditions. Therefore, their complex composition gives rise to numerous operational challenges. Furthermore, when cooperative perception systems produce erroneous predictions, the types of errors and their underlying causes remain insufficiently explored. To bridge this gap, we take an initial step by conducting an empirical study of V2X cooperative perception. To systematically evaluate the impact of cooperative perception on the ego vehicle's perception performance, we identify and analyze six prevalent error patterns in cooperative perception systems. We further conduct a systematic evaluation of the critical components of these systems through our large-scale study and identify the following key findings: (1) The LiDAR-based cooperation configuration exhibits the highest perception performance; (2) Vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communication exhibit distinct cooperative perception performance under different fusion schemes; (3) Increased cooperative perception errors may result in a higher frequency of driving violations; (4) Cooperative perception systems are not robust against communication interference when running online. Our results reveal potential risks and vulnerabilities in critical components of cooperative perception systems. We hope that our findings can better promote the design and repair of cooperative perception systems.
中文: V2X协同感知通过克服单智能体系统的远距离和遮挡限制来提升感知能力,但面临操作挑战和未明错误原因,我们的研究揭示了关键风险与性能发现,以促进系统设计的优化。
English: V2X cooperative perception enhances single-agent systems by overcoming distance and occlusion limitations, yet it faces operational challenges and unexplored error causes, with our study identifying key risks and performance findings to improve system design.

Authors:Yinglong Zou, Juan Zhai, Chunrong Fang, Zhenyu Chen
Title: ThermalGuardian: Temperature-Aware Testing of Automotive Deep Learning Frameworks
Abstract:
Deep learning models play a vital role in autonomous driving systems, supporting critical functions such as environmental perception. To accelerate model inference, these deep learning models' deployment relies on automotive deep learning frameworks, for example, PaddleInference in Apollo and TensorRT in AutoWare. However, unlike deploying deep learning models on the cloud, vehicular environments experience extreme ambient temperatures varying from -40°C to 50°C, significantly impacting GPU temperature. Additionally, heats generated when computing further lead to the GPU temperature increase. These temperature fluctuations lead to dynamic GPU frequency adjustments through mechanisms such as DVFS. However, automotive deep learning frameworks are designed without considering the impact of temperature-induced frequency variations. When deployed on temperature-varying GPUs, these frameworks suffer critical quality issues: compute-intensive operators face delays or errors, high/mixed-precision operators suffer from precision errors, and time-series operators suffer from synchronization issues. The above quality issues cannot be detected by existing deep learning framework testing methods because they ignore temperature's effect on the deep learning framework quality. To bridge this gap, we propose ThermalGuardian, the first automotive deep learning framework testing method under temperature-varying environments. Specifically, ThermalGuardian generates test input models using model mutation rules targeting temperature-sensitive operators, simulates GPU temperature fluctuations based on Newton's law of cooling, and controls GPU frequency based on real-time GPU temperature.
中文摘要:深度学习模型在自动驾驶中至关重要,但在车载环境中因温度波动影响GPU性能而出现质量问题,为此开发了首个应对温度变化挑战的测试方法ThermalGuardian。
English Summary: Deep learning models are crucial for autonomous driving but face quality issues when deployed in vehicles due to temperature fluctuations affecting GPU performance, prompting the development of ThermalGuardian as the first testing method to address these temperature-induced challenges.

Authors:Yinglong Zou, Juan Zhai, Chunrong Fang, Zhenyu Chen
Title: GPU Temperature Simulation-Based Testing for In-Vehicle Deep Learning Frameworks
Abstract:
Deep learning models play a vital role in autonomous driving systems, supporting critical functions such as environmental perception. To accelerate model inference, these deep learning models' deployment relies on automotive deep learning frameworks, for example, PaddleInference in Apollo and TensorRT in AutoWare. However, unlike deploying deep learning models on the cloud, vehicular environments experience extreme ambient temperatures varying from -40°C to 50°C, significantly impacting GPU temperature. Additionally, heats generated when computing further lead to the GPU temperature increase. These temperature fluctuations lead to dynamic GPU frequency adjustments through mechanisms such as DVFS. However, automotive deep learning frameworks are designed without considering the impact of temperature-induced frequency variations. When deployed on temperature-varying GPUs, these frameworks suffer critical quality issues: compute-intensive operators face delays or errors, high/mixed-precision operators suffer from precision errors, and time-series operators suffer from synchronization issues. The above quality issues cannot be detected by existing deep learning framework testing methods because they ignore temperature's effect on the deep learning framework quality. To bridge this gap, we propose ThermalGuardian, the first automotive deep learning framework testing method under temperature-varying environments. Specifically, ThermalGuardian generates test input models using model mutation rules targeting temperature-sensitive operators, simulates GPU temperature fluctuations based on Newton's law of cooling, and controls GPU frequency based on real-time GPU temperature.
中文摘要:深度学习模型在自动驾驶中至关重要,但在车载环境中因温度波动影响GPU性能而出现质量问题,为此开发了首个应对温度变化挑战的测试方法ThermalGuardian。
English Summary: Deep learning models are crucial for autonomous driving but face quality issues when deployed in vehicles due to temperature fluctuations affecting GPU performance, prompting the development of ThermalGuardian as the first testing method to address these temperature-induced challenges.

Authors:Zeyu Chen, Wen Chen, Jun Li, Qingqing Wu, Ming Ding, Xuefeng Han, Xiumei Deng, Liwei Wang
Title: Hierarchical Federated Learning for Social Network with Mobility
Abstract:
Federated Learning (FL) offers a decentralized solution that allows collaborative local model training and global aggregation, thereby protecting data privacy. In conventional FL frameworks, data privacy is typically preserved under the assumption that local data remains absolutely private, whereas the mobility of clients is frequently neglected in explicit modeling. In this paper, we propose a hierarchical federated learning framework based on the social network with mobility namely HFL-SNM that considers both data sharing among clients and their mobility patterns. Under the constraints of limited resources, we formulate a joint optimization problem of resource allocation and client scheduling, which objective is to minimize the energy consumption of clients during the FL process. In social network, we introduce the concepts of Effective Data Coverage Rate and Redundant Data Coverage Rate. We analyze the impact of effective data and redundant data on the model performance through preliminary experiments. We decouple the optimization problem into multiple sub-problems, analyze them based on preliminary experimental results, and propose Dynamic Optimization in Social Network with Mobility (DO-SNM) algorithm. Experimental results demonstrate that our algorithm achieves superior model performance while significantly reducing energy consumption, compared to traditional baseline algorithms.
中文: 本文提出HFL-SNM分层联邦学习框架,通过结合社交网络和客户端移动性来优化资源分配与调度,相比传统方法在显著降低能耗的同时获得了更优的模型性能。
English: This paper introduces HFL-SNM, a hierarchical federated learning framework that integrates social networks and client mobility to optimize resource allocation and scheduling, achieving higher model accuracy with lower energy consumption than traditional methods.

Authors:Xuefeng Han, Wen Chen, Jun Li, Ming Ding, Qingqing Wu, Kang Wei, Xiumei Deng, Yumeng Shao, Qiong Wu
Title: Analysis and Optimization of Wireless Multimodal Federated Learning on Modal Heterogeneity
Abstract:
Multimodal federated learning (MFL) is a distributed framework for training multimodal models without uploading local multimodal data of clients, thereby effectively protecting client privacy. However, multimodal data is commonly heterogeneous across diverse clients, where each client possesses only a subset of all modalities, renders conventional analysis results and optimization methods in unimodal federated learning inapplicable. In addition, fixed latency demand and limited communication bandwidth pose significant challenges for deploying MFL in wireless scenarios. To optimize the wireless MFL performance on modal heterogeneity, this paper proposes a joint client scheduling and bandwidth allocation (JCSBA) algorithm based on a decision-level fusion architecture with adding a unimodal loss function. Specifically, with the decision results, the unimodal loss functions are added to both the training objective and local update loss functions to accelerate multimodal convergence and improve unimodal performance. To characterize MFL performance, we derive a closed-form upper bound related to client and modality scheduling and minimize the derived bound under the latency, energy, and bandwidth constraints through JCSBA. Experimental results on multimodal datasets demonstrate that the JCSBA algorithm improves the multimodal accuracy and the unimodal accuracy by 4.06% and 2.73%, respectively, compared to conventional algorithms.
Chinese: 本文针对无线多模态联邦学习中的模态异构和资源限制问题,提出了一种联合客户端调度与带宽分配算法,将多模态和单模态准确率分别提升了4.06%和2.73%。
English: This paper introduces a joint client scheduling and bandwidth allocation algorithm for wireless multimodal federated learning to address modal heterogeneity and resource constraints, enhancing multimodal and unimodal accuracy by 4.06% and 2.73%, respectively.

Authors:Yifan Jiang, Qingqing Wu, Hongxun Hui, Wen Chen, Derrick Wing Kwan Ng
Title: Low-Altitude UAV Tracking via Sensing-Assisted Predictive Beamforming
Abstract:
Sensing-assisted predictive beamforming, as one of the enabling technologies for emerging integrated sensing and communication (ISAC) paradigm, shows significant promise for enhancing various future unmanned aerial vehicle (UAV) applications. However, current works predominately emphasized on spectral efficiency enhancement, while the impact of such beamforming techniques on the communication reliability was largely unexplored and challenging to characterize. To fill this research gap and tackle this issue, this paper investigates outage capacity maximization for UAV tracking under the sensing-assisted predictive beamforming scheme. Specifically, a cellular-connected UAV tracking scheme is proposed leveraging extended Kalman filtering (EKF), where the predicted UAV trajectory, sensing duration ratio, and target constant received signal-to-noise ratio (SNR) are jointly optimized to maximize the outage capacity at each time slot. To address the implicit nature of the objective function, closed-form approximations of the outage probabilities (OPs) at both prediction and measurement stages of each time slot are proposed based on second-order Taylor expansions, providing an efficient and full characterization of outage capacity. Subsequently, an efficient algorithm is proposed based on a combination of bisection search and successive convex approximation (SCA) to address the non-convex optimization problem with guaranteed convergence. To further reduce computational complexity, a second efficient algorithm is developed based on alternating optimization (AO). Simulation results validate the accuracy of the derived OP approximations, the effectiveness of the proposed algorithms, and the significant outage capacity enhancement over various benchmarks, while also indicating a trade-off between decreasing path loss and enjoying wide beam coverage for outage capacity maximization.
中文: 本文提出了一种面向蜂窝连接无人机追踪的感知辅助预测波束成形方案,通过优化轨迹和参数,采用高效算法最大化中断容量,并经仿真验证其显著性能提升。
English: This paper proposes a sensing-assisted predictive beamforming scheme for cellular-connected UAV tracking, optimizing trajectory and parameters to maximize outage capacity through efficient algorithms and validated by simulations.

Authors:Mingyi Luo, Ruichen Zhang, Xiangwang Hou, Jun Du, Chunxiao Jiang, Yong Ren, Dusit Niyato, Shiwen Mao
Title: Agentic AI Reasoning for Mobile Edge General Intelligence: Fundamentals, Approaches, and Directions
Abstract:
The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we review methods that enhance LLM reasoning capabilities, such as Chain-of-Thought (CoT) prompting, Supervised Fine-Tuning (SFT), and Mixture of Experts (MoE). Next, we present a distributed framework that addresses two correlated aspects: reasoning enhancement through adaptive CoT prompting and scalable deployment through distributed MoE architecture. The framework dynamically activates expert networks and adjusts reasoning depth based on task complexity and device capabilities. We further conduct experimental evaluations in mobile edge environments. Experimental results demonstrate the framework's effectiveness in balancing reasoning quality with resource efficiency, validating the practical viability of deploying sophisticated LLM reasoning capabilities in resource-constrained MEGI environments.
中文摘要:移动边缘通用智能(MEGI)面临大语言模型代理AI的算力挑战,本文提出融合自适应推理增强与分布式部署的联合优化框架,在资源受限环境中实现性能与效率的平衡。
English Summary: The emergence of Mobile Edge General Intelligence (MEGI) faces computational challenges from LLM-based agentic AI, which are addressed through a joint optimization framework combining adaptive reasoning enhancement and distributed deployment to balance performance with resource constraints.

Authors:Xiao Chi, Wenlin Zhong, Yiquan Wu, Wei Wang, Kun Kuang, Fei Wu, Minghui Xiong
Title: Universal Legal Article Prediction via Tight Collaboration between Supervised Classification Model and LLM
Abstract:
Legal Article Prediction (LAP) is a critical task in legal text classification, leveraging natural language processing (NLP) techniques to automatically predict relevant legal articles based on the fact descriptions of cases. As a foundational step in legal decision-making, LAP plays a pivotal role in determining subsequent judgments, such as charges and penalties. Despite its importance, existing methods face significant challenges in addressing the complexities of LAP. Supervised classification models (SCMs), such as CNN and BERT, struggle to fully capture intricate fact patterns due to their inherent limitations. Conversely, large language models (LLMs), while excelling in generative tasks, perform suboptimally in predictive scenarios due to the abstract and ID-based nature of legal articles. Furthermore, the diversity of legal systems across jurisdictions exacerbates the issue, as most approaches are tailored to specific countries and lack broader applicability. To address these limitations, we propose Uni-LAP, a universal framework for legal article prediction that integrates the strengths of SCMs and LLMs through tight collaboration. Specifically, in Uni-LAP, the SCM is enhanced with a novel Top-K loss function to generate accurate candidate articles, while the LLM employs syllogism-inspired reasoning to refine the final predictions. We evaluated Uni-LAP on datasets from multiple jurisdictions, and empirical results demonstrate that our approach consistently outperforms existing baselines, showcasing its effectiveness and generalizability.
中文: Uni-LAP是一个通用框架,通过整合监督分类模型和大语言模型的优势,提升了法律条文预测的准确性和泛化能力,并在多司法管辖区数据集上验证了其有效性。
English: Uni-LAP is a universal framework that combines supervised classification models and large language models to enhance legal article prediction, overcoming the limitations of existing methods and demonstrating superior performance across diverse legal systems.

Authors:Shenghai Yuan, Weixiang Guo, Tianxin Hu, Yu Yang, Jinyu Chen, Rui Qian, Zhongyuan Liu, Lihua Xie
Title: STARC: See-Through-Wall Augmented Reality Framework for Human-Robot Collaboration in Emergency Response
Abstract:
In emergency response missions, first responders must navigate cluttered indoor environments where occlusions block direct line-of-sight, concealing both life-threatening hazards and victims in need of rescue. We present STARC, a see-through AR framework for human-robot collaboration that fuses mobile-robot mapping with responder-mounted LiDAR sensing. A ground robot running LiDAR-inertial odometry performs large-area exploration and 3D human detection, while helmet- or handheld-mounted LiDAR on the responder is registered to the robot's global map via relative pose estimation. This cross-LiDAR alignment enables consistent first-person projection of detected humans and their point clouds - rendered in AR with low latency - into the responder's view. By providing real-time visualization of hidden occupants and hazards, STARC enhances situational awareness and reduces operator risk. Experiments in simulation, lab setups, and tactical field trials confirm robust pose alignment, reliable detections, and stable overlays, underscoring the potential of our system for fire-fighting, disaster relief, and other safety-critical operations. Code and design will be open-sourced upon acceptance.
中文摘要:STARC增强现实框架通过融合机器人与救援人员的激光雷达数据,将隐藏的危险源和受困者的实时三维投影呈现在救援人员视野中,有效提升应急救援的态势感知能力。
English Summary: STARC is an augmented reality framework that enhances first responders' situational awareness by combining robot and responder LiDAR data to project real-time visualizations of hidden hazards and victims directly into their view.

Authors:Tianxin Hu, Weixiang Guo, Ruimeng Liu, Xinhang Xu, Rui Qian, Jinyu Chen, Shenghai Yuan, Lihua Xie
Title: Energy-Constrained Navigation for Planetary Rovers under Hybrid RTG-Solar Power
Abstract:
Future planetary exploration rovers must operate for extended durations on hybrid power inputs that combine steady radioisotope thermoelectric generator (RTG) output with variable solar photovoltaic (PV) availability. While energy-aware planning has been studied for aerial and underwater robots under battery limits, few works for ground rovers explicitly model power flow or enforce instantaneous power constraints. Classical terrain-aware planners emphasize slope or traversability, and trajectory optimization methods typically focus on geometric smoothness and dynamic feasibility, neglecting energy feasibility. We present an energy-constrained trajectory planning framework that explicitly integrates physics-based models of translational, rotational, and resistive power with baseline subsystem loads, under hybrid RTG-solar input. By incorporating both cumulative energy budgets and instantaneous power constraints into SE(2)-based polynomial trajectory optimization, the method ensures trajectories that are simultaneously smooth, dynamically feasible, and power-compliant. Simulation results on lunar-like terrain show that our planner generates trajectories with peak power within 0.55 percent of the prescribed limit, while existing methods exceed limits by over 17 percent. This demonstrates a principled and practical approach to energy-aware autonomy for long-duration planetary missions.
中文: 本研究提出了一种能量约束的行星漫游车轨迹规划框架,将同位素热电-太阳能混合动力模型与瞬时及累积能量约束相结合,确保轨迹功率控制在极限值的0.55%范围内,相比传统方法的17%超标表现显著提升。
English: This study introduces an energy-constrained trajectory planning framework for planetary rovers that integrates hybrid RTG-solar power models with instantaneous and cumulative energy constraints, ensuring trajectories remain within 0.55% of power limits while outperforming conventional methods by over 17% in compliance.

Authors:Shenghai Yuan, Jason Wai Hao Yee, Weixiang Guo, Zhongyuan Liu, Thien-Minh Nguyen, Lihua Xie
Title: PERAL: Perception-Aware Motion Control for Passive LiDAR Excitation in Spherical Robots
Abstract:
Autonomous mobile robots increasingly rely on LiDAR-IMU odometry for navigation and mapping, yet horizontally mounted LiDARs such as the MID360 capture few near-ground returns, limiting terrain awareness and degrading performance in feature-scarce environments. Prior solutions - static tilt, active rotation, or high-density sensors - either sacrifice horizontal perception or incur added actuators, cost, and power. We introduce PERAL, a perception-aware motion control framework for spherical robots that achieves passive LiDAR excitation without dedicated hardware. By modeling the coupling between internal differential-drive actuation and sensor attitude, PERAL superimposes bounded, non-periodic oscillations onto nominal goal- or trajectory-tracking commands, enriching vertical scan diversity while preserving navigation accuracy. Implemented on a compact spherical robot, PERAL is validated across laboratory, corridor, and tactical environments. Experiments demonstrate up to 96 percent map completeness, a 27 percent reduction in trajectory tracking error, and robust near-ground human detection, all at lower weight, power, and cost compared with static tilt, active rotation, and fixed horizontal baselines. The design and code will be open-sourced upon acceptance.
中文摘要:PERAL是一种面向球形机器人的感知意识运动控制框架,通过内部驱动实现被动激光雷达激励,无需额外硬件即可提升垂直扫描多样性,显著改善地形测绘与导航性能。
English Summary: PERAL is a perception-aware motion control framework for spherical robots that passively enhances LiDAR scan diversity through internal actuation, improving terrain mapping and navigation without extra hardware.

Authors:Jianping Li, Xinhang Xu, Zhongyuan Liu, Shenghai Yuan, Muqing Cao, Lihua Xie
Title: AEOS: Active Environment-aware Optimal Scanning Control for UAV LiDAR-Inertial Odometry in Complex Scenes
Abstract:
LiDAR-based 3D perception and localization on unmanned aerial vehicles (UAVs) are fundamentally limited by the narrow field of view (FoV) of compact LiDAR sensors and the payload constraints that preclude multi-sensor configurations. Traditional motorized scanning systems with fixed-speed rotations lack scene awareness and task-level adaptability, leading to degraded odometry and mapping performance in complex, occluded environments. Inspired by the active sensing behavior of owls, we propose AEOS (Active Environment-aware Optimal Scanning), a biologically inspired and computationally efficient framework for adaptive LiDAR control in UAV-based LiDAR-Inertial Odometry (LIO). AEOS combines model predictive control (MPC) and reinforcement learning (RL) in a hybrid architecture: an analytical uncertainty model predicts future pose observability for exploitation, while a lightweight neural network learns an implicit cost map from panoramic depth representations to guide exploration. To support scalable training and generalization, we develop a point cloud-based simulation environment with real-world LiDAR maps across diverse scenes, enabling sim-to-real transfer. Extensive experiments in both simulation and real-world environments demonstrate that AEOS significantly improves odometry accuracy compared to fixed-rate, optimization-only, and fully learned baselines, while maintaining real-time performance under onboard computational constraints. The project page can be found at https://kafeiyin00.github.io/AEOS/.
中文:AEOS框架结合模型预测控制和强化学习,自适应地调控无人机上的激光雷达扫描,在复杂环境中显著提升里程计精度的同时保持实时性能。
English: The AEOS framework combines model predictive control and reinforcement learning to adaptively control LiDAR scanning on UAVs, significantly enhancing odometry accuracy in complex environments while maintaining real-time performance.

Authors:Zheqi Lv, Wenqiao Zhang, Kairui Fu, Qi Tian, Shengyu Zhang, Jiajie Su, Jingyuan Chen, Kun Kuang, Fei Wu
Title: Tackling Device Data Distribution Real-time Shift via Prototype-based Parameter Editing
Abstract:
The on-device real-time data distribution shift on devices challenges the generalization of lightweight on-device models. This critical issue is often overlooked in current research, which predominantly relies on data-intensive and computationally expensive fine-tuning approaches. To tackle this, we introduce Persona, a novel personalized method using a prototype-based, backpropagation-free parameter editing framework to enhance model generalization without post-deployment retraining. Persona employs a neural adapter in the cloud to generate a parameter editing matrix based on real-time device data. This matrix adeptly adapts on-device models to the prevailing data distributions, efficiently clustering them into prototype models. The prototypes are dynamically refined via the parameter editing matrix, facilitating efficient evolution. Furthermore, the integration of cross-layer knowledge transfer ensures consistent and context-aware multi-layer parameter changes and prototype assignment. Extensive experiments on vision task and recommendation task on multiple datasets confirm Persona's effectiveness and generality.
中文: Persona是一种创新方法,通过云端神经适配器生成参数编辑矩阵,无需重新训练即可让设备端模型适应实时数据变化,并将其聚类为动态优化的原型模型。
English: Persona is a novel method that uses a cloud-based neural adapter to generate a parameter editing matrix, enabling on-device models to adapt to real-time data shifts without retraining by clustering them into dynamically refined prototypes.

Authors:Yanwei Gong, Ruichen Zhang, Xiaoqing Wang, Xiaolin Chang, Bo Ai, Junchao Fan, Bocheng Ju, Dusit Niyato
Title: Towards Reliable Service Provisioning for Dynamic UAV Clusters in Low-Altitude Economy Networks
Abstract:
Unmanned Aerial Vehicle (UAV) cluster services are crucial for promoting the low-altitude economy by enabling scalable, flexible, and adaptive aerial networks. To meet diverse service demands, clusters must dynamically incorporate a New UAVs (NUAVs) or an Existing UAV (EUAV). However, achieving sustained service reliability remains challenging due to the need for efficient and scalable NUAV authentication, privacy-preserving cross-cluster authentication for EUAVs, and robust protection of the cluster session key, including both forward and backward secrecy. To address these challenges, we propose a Lightweight and Privacy-Preserving Cluster Authentication and Session Key Update (LP2-CASKU) scheme tailored for dynamic UAV clusters in low-altitude economy networks. LP2-CASKU integrates an efficient batch authentication mechanism that simultaneously authenticates multiple NUAVs with minimal communication overhead. It further introduces a lightweight cross-cluster authentication mechanism that ensures EUAV anonymity and unlinkability. Additionally, a secure session key update mechanism is incorporated to maintain key confidentiality over time, thereby preserving both forward and backward secrecy. We provide a comprehensive security analysis and evaluate LP2-CASKU performance through both theoretical analysis and OMNeT++ simulations. Experimental results demonstrate that, compared to the baseline, LP2-CASKU achieves a latency reduction of 82.8%-90.8% by across different UAV swarm configurations and network bitrates, demonstrating strong adaptability to dynamic communication environments. Besides, under varying UAV swarm configurations, LP2-CASKU reduces the energy consumption by approximately 37.6-72.6%, while effectively supporting privacy-preserving authentication in highly dynamic UAV cluster environments.
中文: LP2-CASKU方案为动态无人机集群提供轻量级隐私保护认证和安全会话密钥更新,在确保前向与后向保密性的同时,显著降低了通信延迟和能耗。
English: The LP2-CASKU scheme provides lightweight, privacy-preserving authentication and secure session key updates for dynamic UAV clusters, significantly reducing latency and energy consumption while ensuring forward and backward secrecy.

Authors:Bisheng Wei, Ruihong Jiang, Ruichen Zhang, Yinqiu Liu, Dusit Niyato, Yaohua Sun, Yang Lu, Yonghui Li, Shiwen Mao, Chau Yuen, Marco Di Renzo, Mugen Peng
Title: Large Language Models for Next-Generation Wireless Network Management: A Survey and Tutorial
Abstract:
The rapid advancement toward sixth-generation (6G) wireless networks has significantly intensified the complexity and scale of optimization problems, including resource allocation and trajectory design, often formulated as combinatorial problems in large discrete decision spaces. However, traditional optimization methods, such as heuristics and deep reinforcement learning (DRL), struggle to meet the demanding requirements of real-time adaptability, scalability, and dynamic handling of user intents in increasingly heterogeneous and resource-constrained network environments. Large language models (LLMs) present a transformative paradigm by enabling natural language-driven problem formulation, context-aware reasoning, and adaptive solution refinement through advanced semantic understanding and structured reasoning capabilities. This paper provides a systematic and comprehensive survey of LLM-enabled optimization frameworks tailored for wireless networks. We first introduce foundational design concepts and distinguish LLM-enabled methods from conventional optimization paradigms. Subsequently, we critically analyze key enabling methodologies, including natural language modeling, solver collaboration, and solution verification processes. Moreover, we explore representative case studies to demonstrate LLMs' transformative potential in practical scenarios such as optimization formulation, low-altitude economy networking, and intent networking. Finally, we discuss current research challenges, examine prominent open-source frameworks and datasets, and identify promising future directions to facilitate robust, scalable, and trustworthy LLM-enabled optimization solutions for next-generation wireless networks.
中文: 向6G网络的过渡加剧了优化挑战,而大语言模型通过自然语言处理和语义推理提供解决方案,本文系统综述了其应用于无线网络优化的框架、方法和前景。
English: The transition to 6G networks has amplified optimization challenges, which large language models address through natural language processing and semantic reasoning, offering a comprehensive framework for wireless network optimization as surveyed in this paper.

Authors:Jinluan Yang, Ruihao Zhang, Zhengyu Chen, Fei Wu, Kun Kuang
Title: Unifying Adversarial Perturbation for Graph Neural Networks
Abstract:
This paper studies the vulnerability of Graph Neural Networks (GNNs) to adversarial attacks on node features and graph structure. Various methods have implemented adversarial training to augment graph data, aiming to bolster the robustness and generalization of GNNs. These methods typically involve applying perturbations to the node feature, weights, or graph structure and subsequently minimizing the loss by learning more robust graph model parameters under the adversarial perturbations. Despite the effectiveness of adversarial training in enhancing GNNs' robustness and generalization abilities, its application has been largely confined to specific datasets and GNN types. In this paper, we propose a novel method, PerturbEmbedding, that integrates adversarial perturbation and training, enhancing GNNs' resilience to such attacks and improving their generalization ability. PerturbEmbedding performs perturbation operations directly on every hidden embedding of GNNs and provides a unified framework for most existing perturbation strategies/methods. We also offer a unified perspective on the forms of perturbations, namely random and adversarial perturbations. Through experiments on various datasets using different backbone models, we demonstrate that PerturbEmbedding significantly improves both the robustness and generalization abilities of GNNs, outperforming existing methods. The rejection of both random (non-targeted) and adversarial (targeted) perturbations further enhances the backbone model's performance.
中文: 本文提出PerturbEmbedding新方法,通过直接在隐藏嵌入层施加对抗性扰动来增强图神经网络的鲁棒性和泛化能力,在多种数据集和模型上均优于现有方法。
English: This paper introduces PerturbEmbedding, a novel method that enhances Graph Neural Networks' robustness and generalization by applying adversarial perturbations directly to hidden embeddings, outperforming existing techniques across various datasets and models.

Authors:Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang, Zhirong Wu, Qi Dai, Bei Liu, Chong Luo, Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Yuan Zhang, Xin Li, Zhaoyi Liu, Xin Geng, Baining Guo
Title: InfoAgent: Advancing Autonomous Information-Seeking Agents
Abstract:
Building Large Language Model agents that expand their capabilities by interacting with external tools represents a new frontier in AI research and applications. In this paper, we introduce InfoAgent, a deep research agent powered by an innovative data synthesis pipeline and orchestrated web search tools. To construct challenging, hard-to-find queries,we build entity trees and apply sub-tree sampling with entity fuzzification to systematically increase question difficulty. Unlike prior work that relies heavily on commercial search tools, we develop a dedicated self-hosted search infrastructure, enhancing transparency of agent environments and facilitating further advancement of agent capacity. We evaluate the effectiveness of our data pipeline by measuring the average number of tool calls required to correctly answer a question, and also show that our agent yields better performance when equipped with our tools. Our \mbox{InfoAgent} is post-trained from Qwen3-14B using a two-stage recipe: cold-start supervised finetuning to instill long-horizon search behaviors, followed by reinforcement learning which significantly improves reasoning-driven tool use. With our methods, InfoAgent achieves 15.3\% accuracy on BrowseComp, 29.2\% on BrowseComp-ZH, and 40.4\% on Xbench-DS, outperforming prior open-source deep research agents such as WebSailor-72B and DeepDive-32B.
中文摘要:InfoAgent是一种通过创新的数据合成流程和自主托管的网络搜索工具来增强能力的大型语言模型智能体,在多项基准测试中优于之前的开源深度研究智能体。
English Summary: InfoAgent is a large language model agent that enhances its capabilities through a novel data synthesis pipeline and self-hosted web search tools, achieving superior performance on various benchmarks compared to previous open-source agents.

Authors:Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li
Title: Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting
Abstract:
In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.
中文: 本文提出了一种基于谱相干性的可预测性诊断框架,通过量化预测难度和模型效率,揭示了可预测性漂移及模型复杂性与数据可预测性之间的权衡关系等核心发现。
English: This paper introduces a predictability-aligned diagnostic framework using spectral coherence to quantify forecasting difficulty and model efficiency, revealing key insights like predictability drift and the trade-off between model complexity and data predictability.

Authors:Fanjin Meng, Yuan Yuan, Jingtao Ding, Jie Feng, Chonghua Han, Yong Li
Title: MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning
Abstract:
Mobility Foundation Models (MFMs) have advanced the modeling of human movement patterns, yet they face a ceiling due to limitations in data scale and semantic understanding. While Large Language Models (LLMs) offer powerful semantic reasoning, they lack the innate understanding of spatio-temporal statistics required for generating physically plausible mobility trajectories. To address these gaps, we propose MoveFM-R, a novel framework that unlocks the full potential of mobility foundation models by leveraging language-driven semantic reasoning capabilities. It tackles two key challenges: the vocabulary mismatch between continuous geographic coordinates and discrete language tokens, and the representation gap between the latent vectors of MFMs and the semantic world of LLMs. MoveFM-R is built on three core innovations: a semantically enhanced location encoding to bridge the geography-language gap, a progressive curriculum to align the LLM's reasoning with mobility patterns, and an interactive self-reflection mechanism for conditional trajectory generation. Extensive experiments demonstrate that MoveFM-R significantly outperforms existing MFM-based and LLM-based baselines. It also shows robust generalization in zero-shot settings and excels at generating realistic trajectories from natural language instructions. By synthesizing the statistical power of MFMs with the deep semantic understanding of LLMs, MoveFM-R pioneers a new paradigm that enables a more comprehensive, interpretable, and powerful modeling of human mobility. The implementation of MoveFM-R is available online at https://anonymous.4open.science/r/MoveFM-R-CDE7/.
Chinese: MoveFM-R 是一种创新框架,通过融合语义推理与时空数据,弥合了移动基础模型与大型语言模型之间的鸿沟,能够根据自然语言指令生成更准确、可解释的人类移动轨迹。
English: MoveFM-R is a novel framework that bridges the gap between mobility foundation models and large language models by integrating semantic reasoning with spatio-temporal data, enabling more accurate and interpretable human mobility trajectory generation from natural language instructions.

Authors:Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, Jingbo Zhu
Title: FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction
Abstract:
Full-Duplex Speech-to-Speech Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling real-time spoken dialogue systems. However, benchmarking and modeling these models remains a fundamental challenge. We introduce FLEXI, the first benchmark for full-duplex LLM-human spoken interaction that explicitly incorporates model interruption in emergency scenarios. FLEXI systematically evaluates the latency, quality, and conversational effectiveness of real-time dialogue through six diverse human-LLM interaction scenarios, revealing significant gaps between open source and commercial models in emergency awareness, turn terminating, and interaction latency. Finally, we suggest that next token-pair prediction offers a promising path toward achieving truly seamless and human-like full-duplex interaction.
中文摘要:FLEXI作为首个全双工语音交互基准,通过六种人机交互场景系统评估实时对话的延迟、质量和会话效果,揭示了开源与商业模型在紧急情况感知和交互延迟方面的显著差距。
English Summary: FLEXI is introduced as the first benchmark for full-duplex speech-to-speech LLMs, evaluating latency, quality, and conversational effectiveness across six scenarios while highlighting gaps in emergency awareness and interaction latency between models.

Authors:Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li
Title: ChaosNexus: A Foundation Model for Universal Chaotic System Forecasting with Multi-scale Representations
Abstract:
Accurately forecasting chaotic systems, prevalent in domains such as weather prediction and fluid dynamics, remains a significant scientific challenge. The inherent sensitivity of these systems to initial conditions, coupled with a scarcity of observational data, severely constrains traditional modeling approaches. Since these models are typically trained for a specific system, they lack the generalization capacity necessary for real-world applications, which demand robust zero-shot or few-shot forecasting on novel or data-limited scenarios. To overcome this generalization barrier, we propose ChaosNexus, a foundation model pre-trained on a diverse corpus of chaotic dynamics. ChaosNexus employs a novel multi-scale architecture named ScaleFormer augmented with Mixture-of-Experts layers, to capture both universal patterns and system-specific behaviors. The model demonstrates state-of-the-art zero-shot generalization across both synthetic and real-world benchmarks. On a large-scale testbed comprising over 9,000 synthetic chaotic systems, it improves the fidelity of long-term attractor statistics by more than 40% compared to the leading baseline. This robust performance extends to real-world applications with exceptional data efficiency. For instance, in 5-day global weather forecasting, ChaosNexus achieves a competitive zero-shot mean error below 1 degree, a result that further improves with few-shot fine-tuning. Moreover, experiments on the scaling behavior of ChaosNexus provide a guiding principle for scientific foundation models: cross-system generalization stems from the diversity of training systems, rather than sheer data volume.
Chinese: ChaosNexus作为一种具有创新多尺度架构的基础模型,通过卓越的零样本泛化能力和数据效率,突破了混沌系统预测中的泛化障碍,在合成与真实场景应用中均实现了领先性能。
English: ChaosNexus, a foundation model with a novel multi-scale architecture, overcomes generalization barriers in chaotic system forecasting by achieving state-of-the-art zero-shot performance and superior data efficiency across synthetic and real-world applications.

Authors:Zihan Yu, Guanren Wang, Jingtao Ding, Huandong Wang, Yong Li
Title: Beyond Formula Complexity: Effective Information Criterion Improves Performance and Interpretability for Symbolic Regression
Abstract:
Symbolic regression discovers accurate and interpretable formulas to describe given data, thereby providing scientific insights for domain experts and promoting scientific discovery. However, existing symbolic regression methods often use complexity metrics as a proxy for interoperability, which only considers the size of the formula but ignores its internal mathematical structure. Therefore, while they can discover formulas with compact forms, the discovered formulas often have structures that are difficult to analyze or interpret mathematically. In this work, inspired by the observation that physical formulas are typically numerically stable under limited calculation precision, we propose the Effective Information Criterion (EIC). It treats formulas as information processing systems with specific internal structures and identifies the unreasonable structure in them by the loss of significant digits or the amplification of rounding noise as data flows through the system. We find that this criterion reveals the gap between the structural rationality of models discovered by existing symbolic regression algorithms and real-world physical formulas. Combining EIC with various search-based symbolic regression algorithms improves their performance on the Pareto frontier and reduces the irrational structure in the results. Combining EIC with generative-based algorithms reduces the number of samples required for pre-training, improving sample efficiency by 2~4 times. Finally, for different formulas with similar accuracy and complexity, EIC shows a 70.2% agreement with 108 human experts' preferences for formula interpretability, demonstrating that EIC, by measuring the unreasonable structures in formulas, actually reflects the formula's interpretability.
中文摘要:提出的有效信息准则通过检测数值不稳定性和结构不合理性来评估符号回归公式,不仅提升了算法性能,还与人类对公式可解释性的偏好高度一致。
English Summary: The proposed Effective Information Criterion (EIC) evaluates symbolic regression formulas by detecting numerical instability and structural unreasonableness, improving algorithm performance and aligning with human interpretability preferences.

Authors:Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou
Title: EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing
Abstract:
Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images -- resulting in limited coverage and inheriting biases from prior generative models -- or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline's modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.
Chinese: 本文提出了EdiVal-Agent,一个自动化、可扩展的评估框架,通过结合视觉语言模型与物体检测器,对基于指令的图像编辑进行细粒度评估,解决了当前评估方法的局限性,并显示出与人类判断更好的一致性。
English: This paper introduces EdiVal-Agent, an automated and scalable evaluation framework that integrates vision-language models with object detectors to provide fine-grained assessment of instruction-based image editing, addressing limitations in current evaluation methods and demonstrating improved alignment with human judgments.

Authors:Chenglong Wang, Yongyu Mu, Hang Zhou, Yifu Huo, Ziming Zhu, Jiali Zeng, Murun Yang, Bei Li, Tong Xiao, Xiaoyang Hao, Chunliang Zhang, Fandong Meng, Jingbo Zhu
Title: GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning
Abstract:
Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.
中文摘要:近年来奖励建模从任务特定设计转向通用模型取得显著进展,但仍面临依赖大规模标注数据的根本挑战;我们提出的GRAM-R²通过自训练方法利用未标注数据生成偏好标签及奖励理由,在多项任务中仅需少量微调即可实现优越性能。
English Summary: Recent advances in reward modeling have shifted towards generalist models, yet they still face challenges due to heavy reliance on labeled data; our proposed GRAM-R² addresses this by using self-training to generate both preference labels and rationales, achieving superior performance across various tasks with minimal fine-tuning.

Authors:Meet Udeshi, Venkata Sai Charan Putrevu, Prashanth Krishnamurthy, Prashant Anantharaman, Sean Carrick, Ramesh Karri, Farshad Khorrami
Title: Binary Diff Summarization using Large Language Models
Abstract:
Security of software supply chains is necessary to ensure that software updates do not contain maliciously injected code or introduce vulnerabilities that may compromise the integrity of critical infrastructure. Verifying the integrity of software updates involves binary differential analysis (binary diffing) to highlight the changes between two binary versions by incorporating binary analysis and reverse engineering. Large language models (LLMs) have been applied to binary analysis to augment traditional tools by producing natural language summaries that cybersecurity experts can grasp for further analysis. Combining LLM-based binary code summarization with binary diffing can improve the LLM's focus on critical changes and enable complex tasks such as automated malware detection. To address this, we propose a novel framework for binary diff summarization using LLMs. We introduce a novel functional sensitivity score (FSS) that helps with automated triage of sensitive binary functions for downstream detection tasks. We create a software supply chain security benchmark by injecting 3 different malware into 6 open-source projects which generates 104 binary versions, 392 binary diffs, and 46,023 functions. On this, our framework achieves a precision of 0.98 and recall of 0.64 for malware detection, displaying high accuracy with low false positives. Across malicious and benign functions, we achieve FSS separation of 3.0 points, confirming that FSS categorization can classify sensitive functions. We conduct a case study on the real-world XZ utils supply chain attack; our framework correctly detects the injected backdoor functions with high FSS.
中文: 本文提出了一种利用大语言模型进行二进制差异摘要的新框架,通过引入功能敏感度评分来增强软件供应链中的恶意软件检测,在自定义基准测试中实现了高精度和召回率,并成功识别了XZ工具等真实案例中的后门程序。
English: This paper introduces a novel framework using large language models for binary diff summarization, incorporating a functional sensitivity score to enhance malware detection in software supply chains, achieving high precision and recall on a custom benchmark and successfully identifying backdoors in real-world cases like XZ utils.

Authors:Xiong Peng, Bo Han, Fengfei Yu, Tongliang Liu, Feng Liu, Mingyuan Zhou
Title: Generative Model Inversion Through the Lens of the Manifold Hypothesis
Abstract:
Model inversion attacks (MIAs) aim to reconstruct class-representative samples from trained models. Recent generative MIAs utilize generative adversarial networks to learn image priors that guide the inversion process, yielding reconstructions with high visual quality and strong fidelity to the private training data. To explore the reason behind their effectiveness, we begin by examining the gradients of inversion loss with respect to synthetic inputs, and find that these gradients are surprisingly noisy. Further analysis reveals that generative inversion implicitly denoises these gradients by projecting them onto the tangent space of the generator manifold, filtering out off-manifold components while preserving informative directions aligned with the manifold. Our empirical measurements show that, in models trained with standard supervision, loss gradients often exhibit large angular deviations from the data manifold, indicating poor alignment with class-relevant directions. This observation motivates our central hypothesis: models become more vulnerable to MIAs when their loss gradients align more closely with the generator manifold. We validate this hypothesis by designing a novel training objective that explicitly promotes such alignment. Building on this insight, we further introduce a training-free approach to enhance gradient-manifold alignment during inversion, leading to consistent improvements over state-of-the-art generative MIAs.
中文: 生成式模型反演攻击通过将噪声梯度投影到生成器流形上实现隐式去噪,从而有效重构私有训练数据;当模型梯度与流形对齐程度更高时,攻击效果显著增强,这为开发更高效的反演方法提供了新思路。
English: Generative model inversion attacks effectively reconstruct private training data by implicitly denoising noisy gradients through projection onto the generator manifold, and their success is enhanced when model gradients align closely with this manifold, leading to improved attack methods.

Authors:Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo
Title: GenExam: A Multidisciplinary Text-to-Image Exam
Abstract:
Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.
中文摘要:GenExam是首个多学科文本到图像的考试基准,通过考试式提示严格评估AI模型的理解、推理和生成综合能力,实验表明即使最先进模型也面临巨大挑战,为通用人工智能发展提供重要参考。
English Summary: GenExam is a pioneering multidisciplinary text-to-image benchmark that rigorously evaluates AI models' integrated understanding, reasoning, and generation capabilities through exam-style prompts, revealing significant performance gaps even in state-of-the-art models.

Authors:Jie Jiang, Siqi Shen, Haining Xie, Yang Li, Yu Shen, Danqing Huang, Bo Qian, Yinjun Wu, Wentao Zhang, Bin Cui, Peng Chen
Title: SQLGovernor: An LLM-powered SQL Toolkit for Real World Application
Abstract:
SQL queries in real world analytical environments, whether written by humans or generated automatically often suffer from syntax errors, inefficiency, or semantic misalignment, especially in complex OLAP scenarios. To address these challenges, we propose SQLGovernor, an LLM powered SQL toolkit that unifies multiple functionalities, including syntax correction, query rewriting, query modification, and consistency verification within a structured framework enhanced by knowledge management. SQLGovernor introduces a fragment wise processing strategy to enable fine grained rewriting and localized error correction, significantly reducing the cognitive load on the LLM. It further incorporates a hybrid self learning mechanism guided by expert feedback, allowing the system to continuously improve through DBMS output analysis and rule validation. Experiments on benchmarks such as BIRD and BIRD CRITIC, as well as industrial datasets, show that SQLGovernor consistently boosts the performance of base models by up to 10%, while minimizing reliance on manual expertise. Deployed in production environments, SQLGovernor demonstrates strong practical utility and effective performance.
中文: SQLGovernor 是一款基于大语言模型的 SQL 工具包,通过语法修正、查询重写与一致性验证等功能,在真实部署中将基础模型性能提升最高达10%。
English: SQLGovernor is an LLM-powered toolkit that addresses SQL query issues through syntax correction, rewriting, and verification, improving base model performance by up to 10% in real-world deployments.

Authors:Cong Chen, Kaixiang Ji, Hao Zhong, Muzhi Zhu, Anzhou Li, Guo Gan, Ziyuan Huang, Cheng Zou, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen
Title: GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI Tasks
Abstract:
Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides dense, step-by-step feedback to guide agents. GUI-Shepherd is trained on a diverse large-scale data set of $52$k interactions that features human-annotated scores and GPT-4o generated rationales, enabling it to serve both as a reward provider for RL training and as a verifier for inference. As far as we know, we are the first to conduct a systematic study of process supervision in GUI agents, across diverse settings from online long-horizon tasks to offline single-step prediction. On the online AndroidWorld benchmark, GUI-Shepherd improves success rate by $7.7$ points via multi-turn online PPO, significantly outperforming Outcome Reward Model based competitors. When used as an inference verifier, it brings $5.1$ points improvements. The benefits generalize to the offline AndroidControl benchmark, with gains of $2.2$ points as a reward provider and $4.3$ points as a verifier. Collectively, our results establish that high-fidelity process supervision is critical for building more capable GUI agents and present a generalizable solution.
中文: GUI-Shepherd提出了一种过程奖励模型,通过提供密集的逐步反馈来解决GUI任务中的稀疏奖励问题,在在线和离线基准测试中通过强化学习和推理验证显著提高了成功率。
English: GUI-Shepherd introduces a process reward model that provides dense, step-by-step feedback to overcome sparse rewards in GUI tasks, significantly improving success rates in both online and offline benchmarks through reinforcement learning and inference verification.

Authors:Cong Chen, Ziyuan Huang, Cheng Zou, Muzhi Zhu, Kaixiang Ji, Jiajia Liu, Jingdong Chen, Hao Chen, Chunhua Shen
Title: HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation
Abstract:
In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\% improvement in rFID ($1.47 \rightarrow 1.07$). When integrated into downstream generation frameworks, it achieves a $1.38\times$ faster convergence rate and an 18.9\% boost in gFID ($16.4 \rightarrow 13.3$), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.
Chinese: HieraTok是一种创新的多尺度视觉Transformer分词器,通过多尺度下采样和尺度因果注意力机制,在图像重建与生成任务中显著超越单尺度模型,实现了性能的大幅提升。
English: HieraTok is a novel multi-scale Vision Transformer tokenizer that enhances image reconstruction and generation by employing multi-scale downsampling and scale-causal attention, achieving significant performance improvements over single-scale models.

Authors:Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou
Title: ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
Abstract:
Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5\% over ReAct, with further gains of up to 8.2\% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3\% Pass@1 on BrowseComp-zh and 18.3\% on BrowseComp-en, surpassing existing open-source web agents.
中文摘要:ReSum范式通过定期将交互历史压缩为精简推理状态,克服了大语言模型网络代理的上下文窗口限制,并借助ReSum-GRPO训练方法实现了超越ReAct的显著性能提升。
English Summary: The ReSum paradigm overcomes context window limitations in LLM-based web agents by periodically summarizing interactions into compact reasoning states, achieving significant performance improvements over ReAct through the ReSum-GRPO training method.

Authors:Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Title: Towards General Agentic Intelligence via Environment Scaling
Abstract:
Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.
中文摘要:本研究通过自动构建多样化模拟环境并采用两阶段训练策略,开发了一个可扩展框架来提升大型语言模型的智能体能力,实验结果表明该方法显著增强了函数调用功能。
English Summary: This research introduces a scalable framework for developing advanced agentic intelligence in Large Language Models by automatically creating diverse simulated environments and employing a two-phase training strategy, with experimental results showing significant improvements in function-calling capabilities.

Authors:Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Title: Scaling Agents via Continual Pre-training
Abstract:
Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
大语言模型在代理任务中因同时学习行为与对齐而产生优化冲突,但我们提出的Agentic CPT方法构建了如AgentFounder-30B的基础模型,在多项基准测试中实现了最优性能。
Large language models struggle with agentic tasks due to optimization conflicts from learning behaviors and alignment simultaneously, but our proposed Agentic CPT method creates foundational models like AgentFounder-30B that achieve state-of-the-art performance across benchmarks.

Authors:Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Title: WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
Abstract:
Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
中文: WebResearcher提出了一种新型AI框架,通过迭代式深度研究和可扩展数据合成克服了现有方法的局限,在多个基准测试中实现最先进性能,同时显著提升了工具使用能力。
English: WebResearcher introduces a novel AI framework that overcomes limitations of existing methods through iterative deep-research and scalable data synthesis, achieving state-of-the-art performance across multiple benchmarks while enhancing tool-use capabilities.

Authors:Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Title: WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
Abstract:
Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents' performance and closing the capability gap.
Chinese: 超越人类认知极限是LLM训练的关键,像DeepResearch这样的专有系统在复杂信息搜索任务中展现出卓越能力,由此开发的WebSailor后训练方法通过生成新任务和高效算法,显著缩小了与开源智能体之间的性能差距。
English: Transcending human cognitive limits is crucial in LLM training, and proprietary systems like DeepResearch show superior abilities in complex information-seeking tasks, leading to the development of WebSailor, a post-training method that closes the capability gap with open-source agents by using novel tasks and efficient algorithms.

Authors:Yuyang Yin, HaoXiang Guo, Fangfu Liu, Mengyu Wang, Hanwen Liang, Eric Li, Yikai Wang, Xiaojie Jin, Yao Zhao, Yunchao Wei
Title: PanoWorld-X: Generating Explorable Panoramic Worlds via Sphere-Aware Video Diffusion
Abstract:
Generating a complete and explorable 360-degree visual world enables a wide range of downstream applications. While prior works have advanced the field, they remain constrained by either narrow field-of-view limitations, which hinder the synthesis of continuous and holistic scenes, or insufficient camera controllability that restricts free exploration by users or autonomous agents. To address this, we propose PanoWorld-X, a novel framework for high-fidelity and controllable panoramic video generation with diverse camera trajectories. Specifically, we first construct a large-scale dataset of panoramic video-exploration route pairs by simulating camera trajectories in virtual 3D environments via Unreal Engine. As the spherical geometry of panoramic data misaligns with the inductive priors from conventional video diffusion, we then introduce a Sphere-Aware Diffusion Transformer architecture that reprojects equirectangular features onto the spherical surface to model geometric adjacency in latent space, significantly enhancing visual fidelity and spatiotemporal continuity. Extensive experiments demonstrate that our PanoWorld-X achieves superior performance in various aspects, including motion range, control precision, and visual quality, underscoring its potential for real-world applications.
中文:PanoWorld-X是一种创新框架,通过球面感知扩散变换器和大型数据集生成高保真、可控的全景视频,克服了以往视场角限制和相机控制不足的问题。
English: PanoWorld-X is a novel framework that generates high-fidelity and controllable panoramic videos with diverse camera trajectories, overcoming previous limitations in field-of-view and camera control through a Sphere-Aware Diffusion Transformer and a large-scale dataset.

Authors:Zhixiong Zhang, Shuangrui Ding, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang
Title: 2nd Place Report of MOSEv2 Challenge 2025: Concept Guided Video Object Segmentation via SeC
Abstract:
Semi-supervised Video Object Segmentation aims to segment a specified target throughout a video sequence, initialized by a first-frame mask. Previous methods rely heavily on appearance-based pattern matching and thus exhibit limited robustness against challenges such as drastic visual changes, occlusions, and scene shifts. This failure is often attributed to a lack of high-level conceptual understanding of the target. The recently proposed Segment Concept (SeC) framework mitigated this limitation by using a Large Vision-Language Model (LVLM) to establish a deep semantic understanding of the object for more persistent segmentation. In this work, we evaluate its zero-shot performance on the challenging coMplex video Object SEgmentation v2 (MOSEv2) dataset. Without any fine-tuning on the training set, SeC achieved 39.7 \JFn on the test set and ranked 2nd place in the Complex VOS track of the 7th Large-scale Video Object Segmentation Challenge.
Chinese: SeC框架利用大型视觉语言模型建立对目标的深层语义理解,在无需微调的情况下实现了鲁棒的半监督视频对象分割,并在挑战赛中获得了第二名。
English: The SeC framework leverages a Large Vision-Language Model to build a deep semantic understanding of objects, enabling robust semi-supervised video object segmentation and achieving second place in a challenge without fine-tuning.

Authors:Yanrui Du, Fenglei Fan, Sendong Zhao, Jiawei Cao, Ting Liu, Bing Qin
Title: MoGU V2: Toward a Higher Pareto Frontier Between Model Usability and Security
Abstract:
As Large Language Models (LLMs) increasingly permeate human life, their security has emerged as a critical concern, particularly their ability to maintain harmless responses to malicious instructions. Although extensive methods have improved LLMs' security, they often lead to conservative, rejection-oriented responses that compromise practical usability. This presents a key challenge: how to advance the Pareto frontier between LLMs' usability and security, rather than necessitate a trade-off between them. To address this, we propose the MoGU framework, in which the intra-layer router dynamically allocates weights by sensing hidden states, thereby balancing the contributions of security-optimized and usability-optimized variants. Despite its initial potential, the MoGU framework faces limitations such as parameter redundancy and performance bottlenecks. To overcome these, we further propose an improved MoGU_v2 framework that establishes a tighter coupling between the routers and hidden states. In MoGU_v2, routers are embedded only in layers encoding highly classifiable security features, and backbone modules are activated during router optimization to enable bidirectional adaptation. MoGU_V2 exhibits strong adaptability and stable improvements across various series of LLMs, including mainstream LLMs serving as brains in various applications, on-device LLMs optimized for resource-constrained scenarios, and reasoning LLMs tailored for user interpretability. Meanwhile, even facing risks introduced by Instruction Fine-tuning, MoGU_v2 can easily restore security without compromising the task performance gains via a simple data-mix strategy. These comprehensive improvements highlight MoGU_V2 as a robust and versatile solution for mitigating security risks in real-world applications.
中文: MoGU_v2框架通过在可分类安全特征的层级嵌入路由器和实现双向适应,有效提升了大型语言模型的实用性与安全性之间的平衡,为各类模型提供了稳健且通用的改进,且不影响性能。
English: The MoGU_v2 framework enhances the balance between usability and security in Large Language Models by embedding routers in layers with classifiable security features and enabling bidirectional adaptation, offering robust and versatile improvements across various LLM types without compromising performance.

Authors:Yanrui Du, Fenglei Fan, Sendong Zhao, Jiawei Cao, Qika Lin, Kai He, Ting Liu, Bing Qin, Mengling Feng
Title: Anchoring Refusal Direction: Mitigating Safety Risks in Tuning via Projection Constraint
Abstract:
Instruction Fine-Tuning (IFT) has been widely adopted as an effective post-training strategy to enhance various abilities of Large Language Models (LLMs). However, prior studies have shown that IFT can significantly compromise LLMs' safety, particularly their ability to refuse malicious instructions, raising significant concerns. Recent research into the internal mechanisms of LLMs has identified the refusal direction (r-direction) in the hidden states, which plays a pivotal role in governing refusal behavior. Building on this insight, our study reveals that the r-direction tends to drift during training, which we identify as one of the causes of the associated safety risks. To mitigate such drift, our proposed ProCon method introduces a projection-constrained loss term that regularizes the projection magnitude of each training sample's hidden state onto the r-direction. Our initial analysis shows that applying an appropriate constraint can effectively mitigate the refusal direction drift and associated safety risks, but remains limited by overall performance barriers. To overcome this barrier, informed by our observation of early-stage sharp drift and a data-driven perspective, we introduce a warm-up strategy that emphasizes early-stage strong constraints and broaden the data distribution to strengthen constraint signals, leading to an enhanced ProCon method. Experimental results under various datasets, scenarios, and LLMs demonstrate that our method can significantly mitigate safety risks posed by IFT while preserving task performance gains. Even compared with strong baselines, our method consistently delivers superior overall performance. Crucially, our analysis indicates that ProCon can contribute to stabilizing the r-direction during training, while such an interpretability-driven exploration of LLMs' internal mechanisms lays a solid foundation for future safety research.
中文: ProCon方法通过投影约束损失项和预热策略有效抑制指令微调中大语言模型的拒绝方向漂移,在保障任务性能的同时显著降低安全风险。
English: The ProCon method effectively mitigates safety risks in instruction fine-tuned large language models by constraining refusal direction drift through projection regularization and a warm-up strategy, maintaining task performance.

Authors:Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Zhicheng Dou
Title: UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression
Abstract:
Large language models are increasingly capable of handling long-context inputs, but the memory overhead of key-value (KV) cache remains a major bottleneck for general-purpose deployment. While various compression strategies have been explored, sequence-level compression, which drops the full KV caches for certain tokens, is particularly challenging as it can lead to the loss of important contextual information. To address this, we introduce UniGist, a sequence-level long-context compression framework that efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner. We adopt a chunk-free training strategy and design an efficient kernel with a gist shift trick, enabling optimized GPU training. Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings. Experiments across multiple long-context tasks demonstrate that UniGist significantly improves compression quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling.
中文: UniGist是一种创新的长上下文压缩框架,通过细粒度地使用压缩标记替代原始标记来高效保留上下文信息,同时支持优化的GPU训练和推理时的实时内存节省,在细节回忆和长程依赖任务中显著提升了性能表现。
English: UniGist is a novel long-context compression framework that replaces raw tokens with fine-grained gist tokens to efficiently preserve context information while enabling optimized GPU training and real-time memory savings during inference, significantly improving performance in detail-recalling and long-range dependency tasks.

Authors:Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu
Title: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Abstract:
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.
中文摘要:EVOL-RL是一种新颖的自改进框架,通过结合多数投票的稳定性和新颖性感知的探索,有效防止语言模型的熵崩溃,显著提升了领域内性能和跨领域泛化能力。
English Summary: EVOL-RL is a novel self-improvement framework that prevents entropy collapse in language models by combining majority-voted stability with novelty-aware exploration, significantly enhancing both in-domain performance and out-of-domain generalization.

Authors:Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu
Title: EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving
Abstract:
Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.
中文摘要:大型语言模型虽提升了自动定理证明性能,但存在计算效率问题;我们提出的EconProver方法通过动态思维链切换和并行强化学习,仅需12%计算成本即可保持同等性能。
English Summary: Large Language Models have improved Automated Theorem Proving but face computational inefficiency, which our proposed EconProver method addresses by reducing token usage and sampling passes while maintaining performance with only 12% of the computational cost.

Authors:Runpeng Dai, Linfeng Song, Haolin Liu, Zhenwen Liang, Dian Yu, Haitao Mi, Zhaopeng Tu, Rui Liu, Tong Zheng, Hongtu Zhu, Dong Yu
Title: CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.
中文摘要:本文提出好奇心驱动探索(CDE)框架,通过结合行动者困惑度和评论家价值估计方差的内在好奇心信号,增强可验证奖励强化学习的探索能力,在AIME基准上实现约3分提升,同时揭示了LLM中校准崩溃这一关键失效模式。
English Summary: The paper introduces Curiosity-Driven Exploration (CDE), a framework that uses intrinsic curiosity signals from both actor perplexity and critic variance to enhance exploration in Reinforcement Learning with Verifiable Rewards, achieving a 3-point improvement on AIME benchmarks while revealing calibration collapse as a key failure mode in LLMs.

Authors:Deepak Alapatt, Jennifer Eckhoff, Zhiliang Lyu, Yutong Ban, Jean-Paul Mazellier, Sarah Choksi, Kunyi Yang, 2024 CVS Challenge Consortium, Quanzheng Li, Filippo Filicori, Xiang Li, Pietro Mascagni, Daniel A. Hashimoto, Guy Rosman, Ozanan Meireles, Nicolas Padoy
Title: The SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment
Abstract:
Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently performed safety step, as an exemplar of surgical quality assessment. A global collaboration across 54 institutions in 24 countries engaged hundreds of clinicians and engineers to curate 1,000 videos annotated by 20 surgical experts according to a consensus-validated protocol. The challenge addressed key barriers to real-world deployment in surgery, including achieving high performance, capturing uncertainty in subjective assessment, and ensuring robustness to clinical variability. To enable this scale of effort, we developed EndoGlacier, a framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated, achieving up to a 17\% relative gain in assessment performance, over 80\% reduction in calibration error, and a 17\% relative improvement in robustness over the state-of-the-art. Analysis of results highlighted methodological trends linked to model performance, providing guidance for future research toward robust, clinically deployable AI for surgical quality assessment.
中文:SAGES CVS挑战赛作为首个由外科学会组织的AI竞赛,通过全球合作克服了关键应用障碍,在评估腹腔镜胆囊切除术安全步骤的AI模型性能上取得显著提升,推动了手术质量评估的发展。
English: The SAGES CVS Challenge, the first surgical society-led AI competition, leveraged global collaboration to advance surgical quality assessment by overcoming key deployment barriers and achieving significant performance improvements in AI models for evaluating laparoscopic cholecystectomy safety steps.

Authors:Hantao Yang, Hong Xie, Defu Lian, Enhong Chen
Title: LLM Cache Bandit Revisited: Addressing Query Heterogeneity for Cost-Effective LLM Inference
Abstract:
This paper revisits the LLM cache bandit problem, with a special focus on addressing the query heterogeneity for cost-effective LLM inference. Previous works often assume uniform query sizes. Heterogeneous query sizes introduce a combinatorial structure for cache selection, making the cache replacement process more computationally and statistically challenging. We treat optimal cache selection as a knapsack problem and employ an accumulation-based strategy to effectively balance computational overhead and cache updates. In theoretical analysis, we prove that the regret of our algorithm achieves an $O(\sqrt{MNT})$ bound, improving the coefficient of $\sqrt{MN}$ compared to the $O(MN\sqrt{T})$ result in Berkeley, where $N$ is the total number of queries and $M$ is the cache size. Additionally, we also provide a problem-dependent bound, which was absent in previous works. The experiment rely on real-world data show that our algorithm reduces the total cost by approximately 12\%.
Chinese: 本文通过将缓存选择视为背包问题来处理LLM缓存强盗问题,提出了一种基于累积的策略,实现了更优的\(O(\sqrt{MNT})\)遗憾界,并在真实数据实验中降低了约12%的总成本。
English: This paper addresses the LLM cache bandit problem by treating cache selection as a knapsack problem, introducing an accumulation-based strategy that achieves a superior regret bound of \(O(\sqrt{MNT})\) and reduces total costs by about 12% in real-world experiments.

Authors:Daniel DeAlcala, Aythami Morales, Julian Fierrez, Gonzalo Mancera, Ruben Tolosana, Javier Ortega-Garcia
Title: Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning
Abstract:
Active Membership Inference Test (aMINT) is a method designed to detect whether given data were used during the training of machine learning models. In Active MINT, we propose a novel multitask learning process that involves training simultaneously two models: the original or Audited Model, and a secondary model, referred to as the MINT Model, responsible for identifying the data used for training the Audited Model. This novel multi-task learning approach has been designed to incorporate the auditability of the model as an optimization objective during the training process of neural networks. The proposed approach incorporates intermediate activation maps as inputs to the MINT layers, which are trained to enhance the detection of training data. We present results using a wide range of neural networks, from lighter architectures such as MobileNet to more complex ones such as Vision Transformers, evaluated in 5 public benchmarks. Our proposed Active MINT achieves over 80% accuracy in detecting if given data was used for training, significantly outperforming previous approaches in the literature. Our aMINT and related methodological developments contribute to increasing transparency in AI models, facilitating stronger safeguards in AI deployments to achieve proper security, privacy, and copyright protection.
中文: Active MINT是一种新颖的多任务学习方法,通过训练辅助模型来检测机器学习模型的训练数据使用情况,在多种架构上实现超过80%的准确率,有效提升人工智能的透明度。
English: Active MINT is a novel multitask learning method that trains a secondary model to detect training data usage in machine learning models, achieving over 80% accuracy across various architectures and enhancing AI transparency.

Authors:Shian Du, Menghan Xia, Chang Liu, Xintao Wang, Jing Wang, Pengfei Wan, Di Zhang, Xiangyang Ji
Title: PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution
Abstract:
Pre-trained video generation models hold great potential for generative video super-resolution (VSR). However, adapting them for full-size VSR, as most existing methods do, suffers from unnecessary intensive full-attention computation and fixed output resolution. To overcome these limitations, we make the first exploration into utilizing video diffusion priors for patch-wise VSR. This is non-trivial because pre-trained video diffusion models are not native for patch-level detail generation. To mitigate this challenge, we propose an innovative approach, called PatchVSR, which integrates a dual-stream adapter for conditional guidance. The patch branch extracts features from input patches to maintain content fidelity while the global branch extracts context features from the resized full video to bridge the generation gap caused by incomplete semantics of patches. Particularly, we also inject the patch's location information into the model to better contextualize patch synthesis within the global video frame. Experiments demonstrate that our method can synthesize high-fidelity, high-resolution details at the patch level. A tailor-made multi-patch joint modulation is proposed to ensure visual consistency across individually enhanced patches. Due to the flexibility of our patch-based paradigm, we can achieve highly competitive 4K VSR based on a 512x512 resolution base model, with extremely high efficiency.
中文摘要:本文提出PatchVSR方法,通过双流适配器利用预训练视频扩散模型实现分块视频超分辨率,在保持内容保真度的同时高效生成高分辨率细节,并能基于低分辨率基础模型实现4K超分辨率。
English Summary: This paper introduces PatchVSR, a novel patch-based video super-resolution method that leverages pre-trained video diffusion models through a dual-stream adapter to efficiently generate high-resolution details while maintaining content fidelity and visual consistency across patches.

Authors:Tianrui Qin, Qianben Chen, Sinuo Wang, He Xing, King Zhu, He Zhu, Dingfeng Shi, Xinxin Liu, Ge Zhang, Jiaheng Liu, Yuchen Eleanor Jiang, Xitong Gao, Wangchunshu Zhou
Title: Flash-Searcher: Fast and Effective Web Agents via DAG-Based Parallel Execution
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks when equipped with external tools. However, current frameworks predominantly rely on sequential processing, leading to inefficient execution particularly for tasks requiring extensive tool interaction. This paper introduces Flash-Searcher, a novel parallel agent reasoning framework that fundamentally reimagines the execution paradigm from sequential chains to directed acyclic graphs (DAGs). Flash-Searcher decomposes complex tasks into subtasks with explicit dependencies, enabling concurrent execution of independent reasoning paths while maintaining logical constraints. Through dynamic workflow optimization, our framework continuously refines the execution graph based on intermediate results, effectively integrating summary module. Comprehensive evaluations across multiple benchmarks demonstrate that Flash-Searcher consistently outperforms existing approaches. Specifically, it achieves 67.7% accuracy on BrowseComp and 83% on xbench-DeepSearch, while reducing agent execution steps by up to 35% compared to current frameworks. Furthermore, when distilling this parallel reasoning pipeline into single models, we observe substantial performance gains across diverse backbone architectures, underscoring the generalizability of our methodology. Our work thus represents a significant advance in agent architecture design, offering a more scalable and efficient paradigm for complex reasoning tasks.
中文:Flash-Searcher提出了一种基于有向无环图的并行推理框架,通过并发执行子任务在减少35%操作步骤的同时显著提升准确率,其方法论在不同模型架构中均展现出卓越的通用性。
English: Flash-Searcher introduces a parallel reasoning framework using directed acyclic graphs to enable concurrent task execution, achieving higher accuracy and up to 35% fewer steps than sequential methods while demonstrating broad applicability across models.

Authors:Yufei Wei, Wangtao Lu, Sha Lu, Chenxiao Hu, Fuzhang Han, Rong Xiong, Yue Wang
Title: BEV-ODOM2: Enhanced BEV-based Monocular Visual Odometry with PV-BEV Fusion and Dense Flow Supervision for Ground Robots
Abstract:
Bird's-Eye-View (BEV) representation offers a metric-scaled planar workspace, facilitating the simplification of 6-DoF ego-motion to a more robust 3-DoF model for monocular visual odometry (MVO) in intelligent transportation systems. However, existing BEV methods suffer from sparse supervision signals and information loss during perspective-to-BEV projection. We present BEV-ODOM2, an enhanced framework addressing both limitations without additional annotations. Our approach introduces: (1) dense BEV optical flow supervision constructed from 3-DoF pose ground truth for pixel-level guidance; (2) PV-BEV fusion that computes correlation volumes before projection to preserve 6-DoF motion cues while maintaining scale consistency. The framework employs three supervision levels derived solely from pose data: dense BEV flow, 5-DoF for the PV branch, and final 3-DoF output. Enhanced rotation sampling further balances diverse motion patterns in training. Extensive evaluation on KITTI, NCLT, Oxford, and our newly collected ZJH-VO multi-scale dataset demonstrates state-of-the-art performance, achieving 40 improvement in RTE compared to previous BEV methods. The ZJH-VO dataset, covering diverse ground vehicle scenarios from underground parking to outdoor plazas, is publicly available to facilitate future research.
中文:BEV-ODOM2通过引入密集BEV光流监督和PV-BEV融合技术,有效解决了稀疏监督和透视投影信息丢失问题,在多个数据集上实现领先性能,相对轨迹误差降低40%。
English: BEV-ODOM2 enhances monocular visual odometry by introducing dense BEV optical flow supervision and PV-BEV fusion to address sparse supervision and information loss, achieving state-of-the-art performance with a 40% RTE improvement across multiple datasets.

Authors:Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai
Title: Generalist Scanner Meets Specialist Locator: A Synergistic Coarse-to-Fine Framework for Robust GUI Grounding
Abstract:
Grounding natural language queries in graphical user interfaces (GUIs) presents a challenging task that requires models to comprehend diverse UI elements across various applications and systems, while also accurately predicting the spatial coordinates for the intended operation. To tackle this problem, we propose GMS: Generalist Scanner Meets Specialist Locator, a synergistic coarse-to-fine framework that effectively improves GUI grounding performance. GMS leverages the complementary strengths of general vision-language models (VLMs) and small, task-specific GUI grounding models by assigning them distinct roles within the framework. Specifically, the general VLM acts as a 'Scanner' to identify potential regions of interest, while the fine-tuned grounding model serves as a 'Locator' that outputs precise coordinates within these regions. This design is inspired by how humans perform GUI grounding, where the eyes scan the interface and the brain focuses on interpretation and localization. Our whole framework consists of five stages and incorporates hierarchical search with cross-modal communication to achieve promising prediction results. Experimental results on the ScreenSpot-Pro dataset show that while the 'Scanner' and 'Locator' models achieve only $2.0\%$ and $3.7\%$ accuracy respectively when used independently, their integration within GMS framework yields an overall accuracy of $35.7\%$, representing a $10 \times$ improvement. Additionally, GMS significantly outperforms other strong baselines under various settings, demonstrating its robustness and potential for general-purpose GUI grounding.
中文摘要:GMS框架通过让通用视觉语言模型担任“扫描器”识别兴趣区域,专业定位模型担任“定位器”输出精确坐标的协同机制,在图形界面自然语言查询任务中实现了35.7%的整体准确率,比单独模型提升10倍。
English Summary: The GMS framework synergistically combines a general vision-language model as a 'Scanner' to identify regions of interest with a specialized grounding model as a 'Locator' for precise coordinate prediction, achieving a 10× accuracy improvement (35.7%) on GUI grounding tasks compared to individual models.

Authors:Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai
Title: FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning
Abstract:
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question. This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require either broad temporal coverage or fine-grained spatial detail. In this paper, we introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT). Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract targeted frames or video clips based on identified knowledge gaps. To train effective dynamic sampling policies, we propose Dynamic Resolution Frame Sampling (DRFS), which exposes models to diverse temporal-spatial trade-offs during learning, and DRFS-GRPO, a group-relative policy optimization algorithm that learns from outcome-based rewards without requiring frame-level annotations. Extensive experiments on challenging benchmarks like MLVU and VideoMME demonstrate that our method significantly outperforms existing models, advancing the state of the art in flexible and efficient video understanding.
中文摘要:FrameMind提出了一种通过强化学习训练的动态视频理解框架,能够在推理过程中自适应地请求视觉信息,在多个基准测试上显著优于传统的固定帧采样方法。
English Summary: FrameMind introduces a dynamic video understanding framework using reinforcement learning to adaptively request visual information during reasoning, significantly outperforming traditional fixed-frame methods on major benchmarks.

Authors:Chunxue Xu, Yiwei Wang, Yujun Cai, Bryan Hooi, Songze Li
Title: Visual CoT Makes VLMs Smarter but More Fragile
Abstract:
Chain-of-Thought (CoT) techniques have significantly enhanced reasoning in Vision-Language Models (VLMs). Extending this paradigm, Visual CoT integrates explicit visual edits, such as cropping or annotating regions of interest, into the reasoning process, achieving superior multimodal performance. However, the robustness of Visual CoT-based VLMs against image-level noise remains unexplored. In this paper, we present the first systematic evaluation of Visual CoT robustness under visual perturbations. Our benchmark spans 12 image corruption types across 4 Visual Question Answering (VQA) datasets, enabling a comprehensive comparison between VLMs that use Visual CoT, and VLMs that do not. The results reveal that integrating Visual CoT consistently improves absolute accuracy regardless of whether the input images are clean or corrupted by noise; however, it also increases sensitivity to input perturbations, resulting in sharper performance degradation compared to standard VLMs. Through extensive analysis, we identify the intermediate reasoning components of Visual CoT, i.e., the edited image patches , as the primary source of fragility. Building on this analysis, we propose a plug-and-play robustness enhancement method that integrates Grounding DINO model into the Visual CoT pipeline, providing high-confidence local visual cues to stabilize reasoning. Our work reveals clear fragility patterns in Visual CoT and offers an effective, architecture-agnostic solution for enhancing visual robustness.
中文: 视觉思维链通过整合视觉编辑增强了视觉语言模型的推理能力,但增加了对图像扰动的敏感性,并提出了利用Grounding DINO模型来稳定性能的解决方案。
English: Visual CoT enhances reasoning in VLMs by integrating visual edits but increases sensitivity to image perturbations, with a proposed solution using Grounding DINO to stabilize performance.

Authors:Jinyi Han, Ying Huang, Ying Liao, Zishang Jiang, Xikun Lu, Haiquan Zhao, Xinyi Wang, Guanghao Zhou, Sihang Jiang, Jiaqing Liang, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao
Title: Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking
Abstract:
Large Reasoning Models (LRMs) have achieved impressive performance on challenging tasks, yet their deep reasoning often incurs substantial computational costs. To achieve efficient reasoning, existing reinforcement learning methods still struggle to construct short reasoning path during the rollout stage, limiting effective learning. Inspired by Evidence Accumulation Models, we find that LRMs have accumulated sufficient information early in reasoning, making further reasoning steps redundant. Based on this insight, we propose Just-Enough Thinking (JET), which trains models to proactively terminate unnecessary reasoning. JET performs trajectory truncation during rollout to expose the model to short, distributionally consistent reasoning paths. Besides, it uses a quality-controlled length reward to better encourage concise reasoning while maintaining correctness. Extensive experiments demonstrate that JET significantly improves reasoning efficiency without sacrificing accuracy. Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark. Our code is available in the GitHub.
中文:提出的“恰到好处思考”(JET)方法训练大型推理模型主动终止不必要的推理步骤,通过轨迹截断和质量控制的长度奖励,在保持准确性的同时显著提升了推理效率。
English: The proposed Just-Enough Thinking (JET) method trains Large Reasoning Models to proactively terminate unnecessary reasoning steps, significantly improving efficiency without sacrificing accuracy by truncating trajectories and using quality-controlled length rewards.

Authors:Hongrui Jia, Chaoya Jiang, Shikun Zhang, Wei Ye
Title: Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning
Abstract:
Significant advancements in the reasoning capabilities of Large Language Models (LLMs) are now driven by test-time scaling laws, particularly those leveraging extended Chain-of-Thought (CoT) reasoning. Inspired by these breakthroughs, researchers have extended these paradigms to Large Multimodal Models (LMMs). However, a critical limitation emerges: as their reasoning chains extend, LMMs increasingly rely on textual logic, progressively losing grounding in the underlying visual information. This leads to reasoning paths that diverge from the image content, culminating in erroneous conclusions. To address this, we introduce a strikingly simple yet effective training-free visual-reasoning pipeline. The core concept is to decouple the reasoning and perception processes. A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain. The LMM, in turn, functions exclusively as a visual question-answering engine, supplying the necessary perceptual details on demand. This lightweight, plug-and-play approach requires no additional training or architectural changes. Comprehensive evaluations validate that our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.
Chinese: 本文提出了一种无需训练的简单有效方法,将推理与感知分离,利用大型语言模型主导推理过程,并让大型多模态模型充当视觉问答引擎,显著减少了无根据的推理步骤并大幅提升了推理准确性。
English: This paper introduces a simple yet effective training-free pipeline that decouples reasoning and perception, using a large language model to guide the reasoning process and a large multimodal model as a visual question-answering engine, significantly reducing unfounded reasoning and improving fidelity.

Authors:Xinhao Zhong, Yimin Zhou, Zhiqi Zhang, Junhao Li, Yi Sun, Bin Chen, Shu-Tao Xia, Ke Xu
Title: Closing the Safety Gap: Surgical Concept Erasure in Visual Autoregressive Models
Abstract:
The rapid progress of visual autoregressive (VAR) models has brought new opportunities for text-to-image generation, but also heightened safety concerns. Existing concept erasure techniques, primarily designed for diffusion models, fail to generalize to VARs due to their next-scale token prediction paradigm. In this paper, we first propose a novel VAR Erasure framework VARE that enables stable concept erasure in VAR models by leveraging auxiliary visual tokens to reduce fine-tuning intensity. Building upon this, we introduce S-VARE, a novel and effective concept erasure method designed for VAR, which incorporates a filtered cross entropy loss to precisely identify and minimally adjust unsafe visual tokens, along with a preservation loss to maintain semantic fidelity, addressing the issues such as language drift and reduced diversity introduce by naïve fine-tuning. Extensive experiments demonstrate that our approach achieves surgical concept erasure while preserving generation quality, thereby closing the safety gap in autoregressive text-to-image generation by earlier methods.
中文: 本文提出了VARE和S-VARE两种新型框架,通过精准调整不安全视觉标记并保持语义保真度,实现了视觉自回归模型中稳定的概念擦除,有效解决了文本生成图像的安全隐患。
English: This paper introduces VARE and S-VARE, novel frameworks for stable concept erasure in visual autoregressive models that address safety concerns by minimizing adjustments to unsafe tokens while preserving image quality and semantic fidelity.

Authors:Dwip Dalal, Gautam Vashishtha, Anku Ranui, Aishwarya Reganti, Parth Patwa, Mohd Sarique, Chandan Gupta, Keshav Nath, Viswanatha Reddy, Vinija Jain, Aman Chadha, Amitava Das, Amit Sheth, Asif Ekbal
Title: DeHate: A Stable Diffusion-based Multimodal Approach to Mitigate Hate Speech in Images
Abstract:
The rise in harmful online content not only distorts public discourse but also poses significant challenges to maintaining a healthy digital environment. In response to this, we introduce a multimodal dataset uniquely crafted for identifying hate in digital content. Central to our methodology is the innovative application of watermarked, stability-enhanced, stable diffusion techniques combined with the Digital Attention Analysis Module (DAAM). This combination is instrumental in pinpointing the hateful elements within images, thereby generating detailed hate attention maps, which are used to blur these regions from the image, thereby removing the hateful sections of the image. We release this data set as a part of the dehate shared task. This paper also describes the details of the shared task. Furthermore, we present DeHater, a vision-language model designed for multimodal dehatification tasks. Our approach sets a new standard in AI-driven image hate detection given textual prompts, contributing to the development of more ethical AI applications in social media.
中文: 本文提出了一种多模态数据集及DeHater模型,结合水印稳定扩散技术和DAAM模块来识别并模糊图像中的仇恨内容,为社交媒体伦理AI发展树立了新标准。
English: This paper introduces a multimodal dataset and DeHater model using watermarked stable diffusion and DAAM to detect and blur hateful content in images, advancing ethical AI for social media.

Authors:Zhen Xiong, Yujun Cai, Zhecheng Li, Junsong Yuan, Yiwei Wang
Title: Thinking with Sound: Audio Chain-of-Thought Enables Multimodal Reasoning in Large Audio-Language Models
Abstract:
Recent Large Audio-Language Models (LALMs) have shown strong performance on various audio understanding tasks such as speech translation and Audio Q\&A. However, they exhibit significant limitations on challenging audio reasoning tasks in complex acoustic scenarios. These situations would greatly benefit from the use of acoustic tools like noise suppression, source separation, and precise temporal alignment, but current LALMs lack access to such tools. To address this limitation, we introduce Thinking-with-Sound (TwS), a framework that equips LALMs with Audio CoT by combining linguistic reasoning with on-the-fly audio-domain analysis. Unlike existing approaches that treat audio as static input, TwS enables models to actively think with audio signals, performing numerical analysis and digital manipulation through multimodal reasoning. To evaluate this approach, we construct MELD-Hard1k, a new robustness benchmark created by introducing various acoustic perturbations. Experiments reveal that state-of-the-art LALMs suffer dramatic performance degradation on MELD-Hard1k, with accuracy dropping by more than $50\%$ compared to clean audio. TwS achieves substantial improvements in robustness, demonstrating both effectiveness and scalability: small models gain $24.73\%$ absolute accuracy, with improvements scaling consistently up to $36.61\%$ for larger models. Our findings demonstrate that Audio CoT can significantly enhance robustness without retraining, opening new directions for developing more robust audio understanding systems.
中文摘要:Thinking-with-Sound框架通过整合声学分析工具解决了大型音频语言模型在复杂场景中的推理局限,在不重新训练模型的情况下,在挑战性音频基准测试中实现了高达36.61%的鲁棒性提升。
English Summary: The Thinking-with-Sound framework addresses audio reasoning limitations in Large Audio-Language Models by integrating acoustic analysis tools, achieving substantial robustness improvements of up to 36.61% on challenging audio benchmarks without requiring model retraining.

Authors:Zhihong Sun, Jia Li, Yao Wan, Chuanyi Li, Hongyu Zhang, Zhi jin, Ge Li, Hong Liu, Chen Lyu, Songlin Hu
Title: Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation
Abstract:
Code vulnerability detection is crucial for ensuring the security and reliability of modern software systems. Recently, Large Language Models (LLMs) have shown promising capabilities in this domain. However, notable discrepancies in detection results often arise when analyzing identical code segments across different training stages of the same model or among architecturally distinct LLMs. While such inconsistencies may compromise detection stability, they also highlight a key opportunity: the latent complementarity among models can be harnessed through ensemble learning to create more robust vulnerability detection systems. In this study, we explore the potential of ensemble learning to enhance the performance of LLMs in source code vulnerability detection. We conduct comprehensive experiments involving five LLMs (i.e., DeepSeek-Coder-6.7B, CodeLlama-7B, CodeLlama-13B, CodeQwen1.5-7B, and StarCoder2-15B), using three ensemble strategies (i.e., Bagging, Boosting, and Stacking). These experiments are carried out across three widely adopted datasets (i.e., Devign, ReVeal, and BigVul). Inspired by Mixture of Experts (MoE) techniques, we further propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection. Our results demonstrate that ensemble approaches can significantly improve detection performance, with Boosting excelling in scenarios involving imbalanced datasets. Moreover, DGS consistently outperforms traditional Stacking, particularly in handling class imbalance and multi-class classification tasks. These findings offer valuable insights into building more reliable and effective LLM-based vulnerability detection systems through ensemble learning.
中文: 集成学习,特别是提出的动态门控堆叠方法,通过利用不同模型在多样化数据集和场景中的互补性,显著提升了大型语言模型在代码漏洞检测中的性能。
English: Ensemble learning, particularly the proposed Dynamic Gated Stacking method, significantly enhances large language models' performance in code vulnerability detection by leveraging model complementarity across diverse datasets and scenarios.

Authors:Minghang Zhu, Zhengliang Shi, Zhiwei Xu, Shiguang Wu, Lingjie Wang, Pengjie Ren, Zhaochun Ren, Zhumin Chen
Title: Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems
Abstract:
The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.
中文: MOAT框架通过迭代对齐规划与执行代理,有效提升了多智能体系统的协作能力,在多项基准测试中显著优于现有最优方法。
English: The MOAT framework enhances multi-agent collaboration by jointly aligning planning and grounding agents through iterative tuning, achieving superior performance on complex tasks compared to existing methods.

Authors:Snehasis Mukhopadhyay, Aryan Kasat, Shivam Dubey, Rahul Karthikeyan, Dhruv Sood, Vinija Jain, Aman Chadha, Amitava Das
Title: AMBEDKAR-A Multi-level Bias Elimination through a Decoding Approach with Knowledge Augmentation for Robust Constitutional Alignment of Language Models
Abstract:
Large Language Models (LLMs) can inadvertently reflect societal biases present in their training data, leading to harmful or prejudiced outputs. In the Indian context, our empirical evaluations across a suite of models reveal that biases around caste and religion are particularly salient. Yet, most existing mitigation strategies are Western-centric and fail to address these local nuances. We propose AMBEDKAR, a framework inspired by the egalitarian vision of Dr B. R. Ambedkar, architect of the Indian Constitution, to guide LLM outputs toward fairness, neutrality, and inclusion in line with Articles 14 to 17. Our approach introduces a Constitution-Aware Decoding Layer, guided by the AI Constitution of India and applied only at inference time, without any parameter updates to the base model. We incorporate a speculative decoding algorithm that proactively reduces casteist and communal bias during generation. This mitigation layer operates directly within the decoding process, avoiding changes to model internals and lowering the computational and infrastructural costs associated with retraining. We reinterpret speculative decoding not merely as an efficiency tool but as a mechanism for fairness. In this framework, a Small Language Model (SLM) acts as a potentially biased generator, while a constitutionally guided Large Language Model (LLM) serves as the verifier. Rather than accelerating generation, the LLM enforces bias-robust trajectories in the SLM outputs. This inversion of roles gives rise to a fairness-by-speculation paradigm. Our approach yields an absolute reduction of bias up to 26.41 percent compared to baseline. Our source code, datasets, and results are available at https://anonymous.4open.science/r/AMBEDKAR-983B/
中文: AMBEDKAR框架通过宪法感知解码层和推测性解码技术,在不重新训练模型的情况下有效减少大型语言模型中的种姓和宗教偏见,在印度语境中实现了高达26.41%的绝对偏见减少。
English: The AMBEDKAR framework introduces a constitution-aware decoding layer and speculative decoding to mitigate caste and religious biases in Large Language Models for the Indian context, achieving up to 26.41% absolute bias reduction without model retraining.

Authors:Zhengran Zeng, Ruikai Shi, Keke Han, Yixin Li, Kaicheng Sun, Yidong Wang, Zhuohao Yu, Rui Xie, Wei Ye, Shikun Zhang
Title: Benchmarking and Studying the LLM-based Code Review
Abstract:
Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWRBench , a new benchmark comprising 1000 manually verified Pull Requests (PRs) from GitHub, offering PR-centric review with full project context. SWRBench employs an objective LLM-based evaluation method that aligns strongly with human judgment (~90 agreement) by verifying if issues from a structured ground truth are covered in generated reviews. Our systematic evaluation of mainstream ACR tools and LLMs on SWRBench reveals that current systems underperform, and ACR tools are more adept at detecting functional errors. Subsequently, we propose and validate a simple multi-review aggregation strategy that significantly boosts ACR performance, increasing F1 scores by up to 43.67%. Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach, offering valuable insights for advancing ACR research.
中文: SWRBench作为新型基准测试,通过采用1000个真实GitHub拉取请求和基于大语言模型的评估方法,解决了自动化代码评审评估中的局限性,其评估结果与人工判断高度一致,不仅揭示了现有系统的不足,还提出了能显著提升性能的多评审聚合策略。
English: SWRBench is a new benchmark addressing limitations in automated code review evaluation by using 1000 real GitHub pull requests with full project context and an LLM-based evaluation method that aligns closely with human judgment, revealing current systems' underperformance while proposing an aggregation strategy that boosts performance significantly.

Authors:Haoyu Zheng, Zhuonan Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Zheqi Lv, Juncheng Li, Siliang Tang, Yueting Zhuang, Hongyang He
Title: Fast Thinking for Large Language Models
Abstract:
Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT) techniques substantially enhance performance on complex reasoning tasks, they remain inefficient, requiring long reasoning traces that increase latency and token usage. In this work, we introduce Latent Codebooks for Fast Thinking, a framework that uses concise CoT sketches only during training to learn a codebook of discrete strategy priors. At inference, the model conditions on a handful of continuous thinking vectors distilled from the codebook in a single pass, enabling strategy-level guidance without producing explicit reasoning tokens. To complement this design, we propose GainRouter, a lightweight routing mechanism that adaptively switches between fast codebook guided inference and slow explicit reasoning, thereby suppressing overthinking and reducing unnecessary token generation. Experiments across multiple reasoning benchmarks show that our approach achieves competitive or superior accuracy while substantially lowering inference cost, offering a practical path toward efficient and controllable reasoning in large language models.
中文: 本文提出了一种利用潜在码本和增益路由器的框架,通过减少显式推理标记的生成,在保持高精度的同时显著降低了大语言模型的推理成本。
English: This paper introduces a framework using latent codebooks and GainRouter for efficient reasoning in LLMs, achieving high accuracy with lower inference costs by reducing explicit token generation.

Authors:Bozhen Hu, Cheng Tan, Siyuan Li, Jiangbin Zheng, Sizhe Qiu, Jun Xia, Stan Z. Li
Title: Multimodal Regression for Enzyme Turnover Rates Prediction
Abstract:
The enzyme turnover rate is a fundamental parameter in enzyme kinetics, reflecting the catalytic efficiency of enzymes. However, enzyme turnover rates remain scarce across most organisms due to the high cost and complexity of experimental measurements. To address this gap, we propose a multimodal framework for predicting the enzyme turnover rate by integrating enzyme sequences, substrate structures, and environmental factors. Our model combines a pre-trained language model and a convolutional neural network to extract features from protein sequences, while a graph neural network captures informative representations from substrate molecules. An attention mechanism is incorporated to enhance interactions between enzyme and substrate representations. Furthermore, we leverage symbolic regression via Kolmogorov-Arnold Networks to explicitly learn mathematical formulas that govern the enzyme turnover rate, enabling interpretable and accurate predictions. Extensive experiments demonstrate that our framework outperforms both traditional and state-of-the-art deep learning approaches. This work provides a robust tool for studying enzyme kinetics and holds promise for applications in enzyme engineering, biotechnology, and industrial biocatalysis.
中文: 本研究提出了一种多模态框架,通过整合酶序列、底物结构和环境因素,利用先进的神经网络和符号回归技术,实现了对酶转化率的高精度且可解释的预测,其性能优于现有方法,在生物技术领域具有广泛应用前景。
English: This study introduces a multimodal framework that integrates enzyme sequences, substrate structures, and environmental factors using advanced neural networks and symbolic regression to accurately and interpretably predict enzyme turnover rates, outperforming existing methods and offering broad applications in biotechnology.

Authors:Zhaoyu Fan, Kaihang Pan, Mingze Zhou, Bosheng Qin, Juncheng Li, Shengyu Zhang, Wenqiao Zhang, Siliang Tang, Fei Wu, Yueting Zhuang
Title: Towards Meta-Cognitive Knowledge Editing for Multimodal LLMs
Abstract:
Knowledge editing enables multimodal large language models (MLLMs) to efficiently update outdated or incorrect information. However, existing benchmarks primarily emphasize cognitive-level modifications while lacking a focus on deeper meta-cognitive processes. To bridge this gap, we introduce CogEdit, a novel benchmark designed to evaluate MLLMs' meta-cognitive knowledge editing abilities across three levels: (1) Counterfactual-Driven Editing, assessing self-awareness of knowledge correctness changes; (2) Boundary Constraint Editing, ensuring appropriate generalization without unintended interference; and (3) Noise-Robust Editing, promoting reflective evaluation of uncertain information. To advance meta-cognitive editing, we propose MIND (Meta-cognitive INtegrated Dynamic Knowledge Editing), a framework that constructs a meta-knowledge memory for self-awareness, employs game-theoretic interactions to monitor knowledge activation, and incorporates label refinement for noise-robust updates. Extensive experiments show that MIND significantly outperforms existing cognitive editing approaches, achieving strong performance on both traditional and meta-cognitive knowledge editing benchmarks.
中文:该摘要介绍了CogEdit——一个从三个层面评估多模态大语言模型元认知知识编辑能力的基准,并提出了MIND框架,实验证明该框架在传统与元认知知识编辑任务中均显著优于现有认知编辑方法。
English: This abstract introduces CogEdit, a benchmark for evaluating meta-cognitive knowledge editing in multimodal large language models across three levels, and proposes the MIND framework, which demonstrates superior performance over existing cognitive editing methods through extensive experiments.

Authors:Tianrui Wang, Haoyu Wang, Meng Ge, Cheng Gong, Chunyu Qiang, Ziyang Ma, Zikang Huang, Guanrou Yang, Xiaobao Wang, Eng Siong Chng, Xie Chen, Longbiao Wang, Jianwu Dang
Title: Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
Abstract:
While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.
中文:WeSCon是一个自训练框架,无需专用数据集即可在零样本TTS模型中实现词级情感和语速控制,通过过渡平滑和动态控制机制达到了最先进的性能表现。
English: WeSCon is a self-training framework that enables word-level control of emotion and speaking rate in zero-shot TTS models without requiring specialized datasets, achieving state-of-the-art performance through transition-smoothing and dynamic control mechanisms.

Authors:Wangbo Zhao, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Pengfei Zhou, Kai Wang, Bohan Zhuang, Zhangyang Wang, Fan Wang, Yang You
Title: RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
Abstract:
Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers, a framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads - Step-Skip, Cache-Reuse, and Sparse-Attention - observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model's distribution. Across state-of-the-art DiT backbones, including Stable Diffusion 3 and FLUX, RAPID3 achieves nearly 3x faster sampling with competitive generation quality.
中文: RAPID3提出无需训练的三级策略框架,通过轻量化策略头实现扩散Transformer的逐图自适应加速,在保持生成质量的同时将采样速度提升近3倍,其策略通过强化学习和对抗判别联合优化。
English: RAPID3 introduces a training-free framework with three lightweight policy heads that enable image-specific acceleration for Diffusion Transformers, achieving nearly 3x faster sampling while maintaining competitive generation quality through reinforcement learning and adversarial discrimination.

Authors:Bingsheng Yao, Menglin Zhao, Zhan Zhang, Pengqi Wang, Emma G Chester, Changchang Yin, Tianshi Li, Varun Mishra, Lace Padilla, Odysseas Chatzipanagiotou, Timothy Pawlik, Ping Zhang, Weidan Cao, Dakuo Wang
Title: Exploring Collaboration Breakdowns Between Provider Teams and Patients in Post-Surgery Care
Abstract:
Post-surgery care involves ongoing collaboration between provider teams and patients, which starts from post-surgery hospitalization through home recovery after discharge. While prior HCI research has primarily examined patients' challenges at home, less is known about how provider teams coordinate discharge preparation and care handoffs, and how breakdowns in communication and care pathways may affect patient recovery. To investigate this gap, we conducted semi-structured interviews with 13 healthcare providers and 4 patients in the context of gastrointestinal (GI) surgery. We found coordination boundaries between in- and out-patient teams, coupled with complex organizational structures within teams, impeded the "invisible work" of preparing patients' home care plans and triaging patient information. For patients, these breakdowns resulted in inadequate preparation for home transition and fragmented self-collected data, both of which undermine timely clinical decision-making. Based on these findings, we outline design opportunities to formalize task ownership and handoffs, contextualize co-temporal signals, and align care plans with home resources.
中文: 本研究发现,住院与门诊团队间的协调鸿沟及复杂的组织结构阻碍了出院准备和患者数据管理,导致家庭过渡准备不足和自我监测数据碎片化,从而影响临床决策的及时性。
English: This study reveals that coordination gaps between inpatient and outpatient teams, along with complex organizational structures, hinder effective discharge preparation and patient data management, leading to inadequate home transition and fragmented self-monitoring that impede timely clinical decisions.

Authors:Lu Sun, Shihan Fu, Bingsheng Yao, Yuxuan Lu, Wenbo Li, Hansu Gu, Jiri Gesi, Jing Huang, Chen Luo, Dakuo Wang
Title: LLM Agent Meets Agentic AI: Can LLM Agents Simulate Customers to Evaluate Agentic-AI-based Shopping Assistants?
Abstract:
Agentic AI is emerging, capable of executing tasks through natural language, such as Copilot for coding or Amazon Rufus for shopping. Evaluating these systems is challenging, as their rapid evolution outpaces traditional human evaluation. Researchers have proposed LLM Agents to simulate participants as digital twins, but it remains unclear to what extent a digital twin can represent a specific customer in multi-turn interaction with an agentic AI system. In this paper, we recruited 40 human participants to shop with Amazon Rufus, collected their personas, interaction traces, and UX feedback, and then created digital twins to repeat the task. Pairwise comparison of human and digital-twin traces shows that while agents often explored more diverse choices, their action patterns aligned with humans and yielded similar design feedback. This study is the first to quantify how closely LLM agents can mirror human multi-turn interaction with an agentic AI system, highlighting their potential for scalable evaluation.
中文: 本研究对比了人类与数字孪生与亚马逊Rufus的交互,发现尽管LLM智能体探索了更多选项,但其行为模式与人类高度一致且反馈相似,证明了它们在代理AI系统可扩展评估方面的潜力。
English: This study compares human and digital twin interactions with Amazon Rufus, finding that while LLM agents explore more options, they align closely with human action patterns and feedback, demonstrating their potential for scalable evaluation of agentic AI systems.

Authors:Wei Wan, Yuxuan Ning, Zhicong Huang, Cheng Hong, Shengshan Hu, Ziqi Zhou, Yechao Zhang, Tianqing Zhu, Wanlei Zhou, Leo Yu Zhang
Title: MARS: A Malignity-Aware Backdoor Defense in Federated Learning
Abstract:
Federated Learning (FL) is a distributed paradigm aimed at protecting participant data privacy by exchanging model parameters to achieve high-quality model training. However, this distributed nature also makes FL highly vulnerable to backdoor attacks. Notably, the recently proposed state-of-the-art (SOTA) attack, 3DFed (SP2023), uses an indicator mechanism to determine whether the backdoor models have been accepted by the defender and adaptively optimizes backdoor models, rendering existing defenses ineffective. In this paper, we first reveal that the failure of existing defenses lies in the employment of empirical statistical measures that are loosely coupled with backdoor attacks. Motivated by this, we propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy (BE) to indicate the malicious extent of each neuron. To amplify malignity, we further extract the most prominent BE values from each model to form a concentrated backdoor energy (CBE). Finally, a novel Wasserstein distance-based clustering method is introduced to effectively identify backdoor models. Extensive experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.
中文摘要:联邦学习易受3DFed等后门攻击,而提出的MARS防御通过后门能量和Wasserstein距离聚类有效识别并抵御这些威胁,显著优于现有防御方法。
English Summary: Federated Learning is vulnerable to backdoor attacks like 3DFed, but the proposed MARS defense uses backdoor energy and Wasserstein distance clustering to effectively identify and counter these threats, outperforming existing methods.

Authors:Bingsheng Yao, Jiaju Chen, Chaoran Chen, April Wang, Toby Jia-jun Li, Dakuo Wang
Title: Through the Lens of Human-Human Collaboration: A Configurable Research Platform for Exploring Human-Agent Collaboration
Abstract:
Intelligent systems have traditionally been designed as tools rather than collaborators, often lacking critical characteristics that collaboration partnerships require. Recent advances in large language model (LLM) agents open new opportunities for human-LLM-agent collaboration by enabling natural communication and various social and cognitive behaviors. Yet it remains unclear whether principles of computer-mediated collaboration established in HCI and CSCW persist, change, or fail when humans collaborate with LLM agents. To support systematic investigations of these questions, we introduce an open and configurable research platform for HCI researchers. The platform's modular design allows seamless adaptation of classic CSCW experiments and manipulation of theory-grounded interaction controls. We demonstrate the platform's effectiveness and usability through two case studies: (1) re-implementing the classic human-human-collaboration task Shape Factory as a between-subject human-agent-collaboration experiment with 16 participants, and (2) a participatory cognitive walkthrough with five HCI researchers to refine workflows and interfaces for experiment setup and analysis.
中文: 传统智能系统被设计为工具而非协作伙伴,但大型语言模型(LLM)智能体的发展为人类与智能体协作开辟了新途径,不过现有计算机中介协作原则在人类与LLM智能体协作中的适用性尚不明确;为此我们开发了一个可配置的HCI研究平台,并通过人机协作实验和参与式评估案例验证了其有效性。
English: Traditional intelligent systems have been designed as tools rather than collaborative partners, but recent advances in LLM agents offer new opportunities for human-agent collaboration, though it remains unclear how established computer-mediated collaboration principles apply; to investigate this, we introduce a configurable research platform for HCI researchers, demonstrated through case studies including a human-agent collaboration experiment and participatory evaluations.

Authors:Minfeng Qi, Tianqing Zhu, Lefeng Zhang, Ningran Li, Wanlei Zhou
Title: Towards Transparent and Incentive-Compatible Collaboration in Decentralized LLM Multi-Agent Systems: A Blockchain-Driven Approach
Abstract:
Large Language Models (LLMs) have enabled the emergence of autonomous agents capable of complex reasoning, planning, and interaction. However, coordinating such agents at scale remains a fundamental challenge, particularly in decentralized environments where communication lacks transparency and agent behavior cannot be shaped through centralized incentives. We propose a blockchain-based framework that enables transparent agent registration, verifiable task allocation, and dynamic reputation tracking through smart contracts. The core of our design lies in two mechanisms: a matching score-based task allocation protocol that evaluates agents by reputation, capability match, and workload; and a behavior-shaping incentive mechanism that adjusts agent behavior via feedback on performance and reward. Our implementation integrates GPT-4 agents with Solidity contracts and demonstrates, through 50-round simulations, strong task success rates, stable utility distribution, and emergent agent specialization. The results underscore the potential for trustworthy, incentive-compatible multi-agent coordination in open environments.
中文: 本文提出一种基于区块链的框架,通过智能合约实现自主智能体的透明协调,利用信誉任务分配和激励机制在去中心化环境中达成高任务成功率并涌现专业化分工。
English: This paper presents a blockchain framework that enables transparent coordination of autonomous agents through smart contracts, using reputation-based task allocation and incentive mechanisms to achieve high success rates and emergent specialization in decentralized environments.

Authors:Jingyu Tang, Chaoran Chen, Jiawen Li, Zhiping Zhang, Bingcan Guo, Ibrahim Khalilov, Simret Araya Gebreegziabher, Bingsheng Yao, Dakuo Wang, Yanfang Ye, Tianshi Li, Ziang Xiao, Yaxing Yao, Toby Jia-Jun Li
Title: Dark Patterns Meet GUI Agents: LLM Agent Susceptibility to Manipulative Interfaces and the Role of Human Oversight
Abstract:
The dark patterns, deceptive interface designs manipulating user behaviors, have been extensively studied for their effects on human decision-making and autonomy. Yet, with the rising prominence of LLM-powered GUI agents that automate tasks from high-level intents, understanding how dark patterns affect agents is increasingly important. We present a two-phase empirical study examining how agents, human participants, and human-AI teams respond to 16 types of dark patterns across diverse scenarios. Phase 1 highlights that agents often fail to recognize dark patterns, and even when aware, prioritize task completion over protective action. Phase 2 revealed divergent failure modes: humans succumb due to cognitive shortcuts and habitual compliance, while agents falter from procedural blind spots. Human oversight improved avoidance but introduced costs such as attentional tunneling and cognitive load. Our findings show neither humans nor agents are uniformly resilient, and collaboration introduces new vulnerabilities, suggesting design needs for transparency, adjustable autonomy, and oversight.
中文摘要:暗黑模式对人类和AI代理构成不同风险,人类因认知偏见而受骗,AI代理因程序盲点而忽视欺骗性设计,人机协作反而引入新漏洞,需通过透明设计和可调节监管来应对。
English Summary: Dark patterns pose distinct risks to both humans and AI agents, with humans falling prey to cognitive biases while agents overlook deceptive designs due to procedural gaps, and human-AI collaboration introduces new vulnerabilities requiring transparent and adjustable oversight.

Authors:Faqian Guan, Tianqing Zhu, Zhoutian Wang, Wei Ren, Wanlei Zhou
Title: Graph Unlearning: Efficient Node Removal in Graph Neural Networks
Abstract:
With increasing concerns about privacy attacks and potential sensitive information leakage, researchers have actively explored methods to efficiently remove sensitive training data and reduce privacy risks in graph neural network (GNN) models. Node unlearning has emerged as a promising technique for protecting the privacy of sensitive nodes by efficiently removing specific training node information from GNN models. However, existing node unlearning methods either impose restrictions on the GNN structure or do not effectively utilize the graph topology for node unlearning. Some methods even compromise the graph's topology, making it challenging to achieve a satisfactory performance-complexity trade-off. To address these issues and achieve efficient unlearning for training node removal in GNNs, we propose three novel node unlearning methods: Class-based Label Replacement, Topology-guided Neighbor Mean Posterior Probability, and Class-consistent Neighbor Node Filtering. Among these methods, Topology-guided Neighbor Mean Posterior Probability and Class-consistent Neighbor Node Filtering effectively leverage the topological features of the graph, resulting in more effective node unlearning. To validate the superiority of our proposed methods in node unlearning, we conducted experiments on three benchmark datasets. The evaluation criteria included model utility, unlearning utility, and unlearning efficiency. The experimental results demonstrate the utility and efficiency of the proposed methods and illustrate their superiority compared to state-of-the-art node unlearning methods. Overall, the proposed methods efficiently remove sensitive training nodes and protect the privacy information of sensitive nodes in GNNs. The findings contribute to enhancing the privacy and security of GNN models and provide valuable insights into the field of node unlearning.
中文: 本研究针对图神经网络中的隐私风险,提出了三种高效的节点遗忘方法,通过有效利用图拓扑结构来移除敏感训练数据,实验证明其在效用和效率上均优于现有技术。
English: To address privacy risks in graph neural networks, this study introduces three efficient node unlearning methods that effectively remove sensitive training data while leveraging graph topology, demonstrating superior performance in utility and efficiency compared to existing approaches.

Authors:Xinyu Tang, Zhenduo Zhang, Yurou Liu, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou
Title: Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward
Abstract:
Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large datasets, leading to high training costs and low data efficiency. To mitigate this issue, we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty. During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential, thereby reducing substantial rollout computational costs. Furthermore, we incorporate a replay mechanism for under-explored samples to ensure adequate training, which enhances the model's final convergence performance. Experiments across five reasoning benchmarks show that DEPO consistently outperforms existing methods in both offline and online data selection scenarios. Notably, using only 20% of the training data, our approach achieves a 1.85 times speed-up on AIME24 and a 1.66 times speed-up on AIME25 compared to GRPO trained on the full dataset.
Chinese: DEPO提出了一种数据高效策略优化流程,通过在离线阶段精选高质量训练样本并在在线强化学习验证奖励训练中动态筛选样本,以更低的计算成本和训练数据显著提升了推理模型的性能。
English: DEPO introduces a data-efficient policy optimization pipeline that enhances reasoning models by strategically selecting high-quality offline data and dynamically filtering online samples during RLVR training, achieving superior performance with significantly reduced computational costs and training data.

Authors:Chunyang Jiang, Yonggang Zhang, Yiyang Cai, Chi-Min Chan, Yulong Liu, Mingming Chen, Wei Xue, Yike Guo
Title: Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks
Abstract:
The rising cost of acquiring supervised data has driven significant interest in self-improvement for large language models (LLMs). Straightforward unsupervised signals like majority voting have proven effective in generating pseudo-labels for verifiable tasks, while their applicability to unverifiable tasks (e.g., translation) is limited by the open-ended character of responses. As a result, self-evaluation mechanisms (e.g., self-judging and entropy minimization) are predominantly used to derive pseudo-labels. However, self-evaluation relying on LLMs typically incurs high computational overhead and introduces overconfidence issues due to intrinsic biases. To address these challenges, we propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement. Inspired by majority voting commonly employed in verifiable tasks, we propose semantic voting as a novel mechanism that relaxes the principle of hard matching (i.e., exact matching) toward soft matching (i.e., semantic similarity). Soft matching is achieved by leveraging a lightweight sentence embedding model to quantify semantic similarity, thereby mitigating excessive computational burden and intrinsic bias-associated limitations of self-evaluation. Comprehensive experiments demonstrate that our method achieves substantial gains in computational efficiency and overall better performance than self-evaluation methods across diverse model architectures and tasks.
Chinese: 针对不可验证任务中自评估方法的高计算成本和过度自信问题,本研究提出了一种无需自评估的语义投票机制,通过轻量级嵌入模型量化语义相似性替代精确匹配,显著提升了计算效率和整体性能。
English: To address the computational overhead and overconfidence issues of self-evaluation in LLMs for unverifiable tasks, this study introduces a self-evaluation-free approach using semantic voting, which replaces exact matching with semantic similarity measured by a lightweight embedding model, achieving enhanced efficiency and performance.

Authors:Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou, Chi-min Chan, Chengkai Hou, Wei Xue, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang
Title: WoW: Towards a World omniscient World model Through Embodied Interaction
Abstract:
Humans develop an understanding of intuitive physics through active interaction with the world. This approach is in stark contrast to current video models, such as Sora, which rely on passive observation and therefore struggle with grasping physical causality. This observation leads to our central hypothesis: authentic physical intuition of the world model must be grounded in extensive, causally rich interactions with the real world. To test this hypothesis, we present WoW, a 14-billion-parameter generative world model trained on 2 million robot interaction trajectories. Our findings reveal that the model's understanding of physics is a probabilistic distribution of plausible outcomes, leading to stochastic instabilities and physical hallucinations. Furthermore, we demonstrate that this emergent capability can be actively constrained toward physical realism by SOPHIA, where vision-language model agents evaluate the DiT-generated output and guide its refinement by iteratively evolving the language instructions. In addition, a co-trained Inverse Dynamics Model translates these refined plans into executable robotic actions, thus closing the imagination-to-action loop. We establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluation, demonstrating strong ability in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is a cornerstone for developing physical intuition in AI. Models, data, and benchmarks will be open-sourced.
中文: 人类通过主动互动发展直观物理认知,而如Sora等被动视频模型难以把握因果关系,因此我们开发了基于机器人交互训练的WoW模型,它展现出概率性物理理解能力,并在物理推理基准测试中取得领先表现。
English: Humans learn intuitive physics through active interaction, unlike passive video models like Sora, which struggle with causality, leading to the WoW model trained on robot interactions that demonstrates probabilistic physics understanding and achieves top performance in physical reasoning benchmarks.

Authors:Enguang Liu, Siyuan Liang, Liming Lu, Xiyu Zeng, Xiaochun Cao, Aishan Liu, Shuchao Pang
Title: RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation
Abstract:
The safety and reliability of embodied agents rely on accurate and unbiased visual perception. However, existing benchmarks mainly emphasize generalization and robustness under perturbations, while systematic quantification of visual bias remains scarce. This gap limits a deeper understanding of how perception influences decision-making stability. To address this issue, we propose RoboView-Bias, the first benchmark specifically designed to systematically quantify visual bias in robotic manipulation, following a principle of factor isolation. Leveraging a structured variant-generation framework and a perceptual-fairness validation protocol, we create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions. Using this benchmark, we systematically evaluate three representative embodied agents across two prevailing paradigms and report three key findings: (i) all agents exhibit significant visual biases, with camera viewpoint being the most critical factor; (ii) agents achieve their highest success rates on highly saturated colors, indicating inherited visual preferences from underlying VLMs; and (iii) visual biases show strong, asymmetric coupling, with viewpoint strongly amplifying color-related bias. Finally, we demonstrate that a mitigation strategy based on a semantic grounding layer substantially reduces visual bias by approximately 54.5\% on MOKA. Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.
中文摘要:该研究提出了首个系统性量化机器人操作中视觉偏差的基准RoboView-Bias,揭示了具身智能体存在显著视觉偏差,并通过语义基础层缓解策略将偏差降低了54.5%。
English Summary: The study introduces RoboView-Bias, a pioneering benchmark to systematically quantify visual bias in robotic manipulation, revealing significant biases in embodied agents and demonstrating a mitigation strategy that reduces bias by 54.5%.

Authors:Sitong Cheng, Weizhen Bian, Xinsheng Wang, Ruibin Yuan, Jianyi Chen, Shunshun Yin, Yike Guo, Wei Xue
Title: UniSS: Unified Expressive Speech-to-Speech Translation with Your Voice
Abstract:
The ultimate goal of expressive speech-to-speech translation (S2ST) is to accurately translate spoken content while preserving the speaker identity and emotional style. However, progress in this field is largely hindered by three key challenges: the scarcity of paired speech data that retains expressive styles, the complexity of multi-stage processing pipelines, and the limited transfer of translation capabilities from large language models (LLMs). In this work, we address these challenges by introducing UniSS, a novel single-stage framework for expressive S2ST. Our approach features carefully designed speech semantic and style modeling, enabling seamless integration with existing text-based LLM frameworks to develop a unified text-speech language model. To transfer translation capabilities from text to speech, we propose a cross-modal chain-of-thought prompting process that progressively aligns audio semantics with text and ensures style preservation in the decoded results. Furthermore, we construct and release a large-scale, high-quality expressive S2ST dataset, UniST, comprising 44.8k hours of data. Experimental results show that UniSS significantly outperforms previous methods in translation fidelity and speech quality while preserving voice, emotion, and duration consistency. Our work establishes a simpler and more effective paradigm for building the next generation of expressive S2ST systems. Audio samples are available at https://cmots.github.io/uniss-demo.
中文摘要:UniSS框架通过精心设计的语音语义与风格建模,结合文本大语言模型提出单阶段语音翻译方案,利用跨模态思维链提示和发布的大规模数据集解决了数据稀缺与流程复杂性问题,在保持音色情感一致性的同时显著提升了翻译准确度与语音质量。
English Summary: The UniSS framework introduces a single-stage approach for expressive speech-to-speech translation by integrating speech semantic and style modeling with text-based LLMs, overcoming data scarcity and pipeline complexity through cross-modal alignment and a newly released 44.8k-hour dataset, achieving superior translation fidelity and style preservation.

Authors:Dehong Kong, Sifan Yu, Siyuan Liang, Jiawei Liang, Jianhou Gan, Aishan Liu, Wenqi Ren
Title: Universal Camouflage Attack on Vision-Language Models for Autonomous Driving
Abstract:
Visual language modeling for automated driving is emerging as a promising research direction with substantial improvements in multimodal reasoning capabilities. Despite its advanced reasoning abilities, VLM-AD remains vulnerable to serious security threats from adversarial attacks, which involve misleading model decisions through carefully crafted perturbations. Existing attacks have obvious challenges: 1) Physical adversarial attacks primarily target vision modules. They are difficult to directly transfer to VLM-AD systems because they typically attack low-level perceptual components. 2) Adversarial attacks against VLM-AD have largely concentrated on the digital level. To address these challenges, we propose the first Universal Camouflage Attack (UCA) framework for VLM-AD. Unlike previous methods that focus on optimizing the logit layer, UCA operates in the feature space to generate physically realizable camouflage textures that exhibit strong generalization across different user commands and model architectures. Motivated by the observed vulnerability of encoder and projection layers in VLM-AD, UCA introduces a feature divergence loss (FDL) that maximizes the representational discrepancy between clean and adversarial images. In addition, UCA incorporates a multi-scale learning strategy and adjusts the sampling ratio to enhance its adaptability to changes in scale and viewpoint diversity in real-world scenarios, thereby improving training stability. Extensive experiments demonstrate that UCA can induce incorrect driving commands across various VLM-AD models and driving scenarios, significantly surpassing existing state-of-the-art attack methods (improving 30\% in 3-P metrics). Furthermore, UCA exhibits strong attack robustness under diverse viewpoints and dynamic conditions, indicating high potential for practical deployment.
中文: 针对自动驾驶的视觉语言建模面临严重的安全威胁,因此提出了通用伪装攻击框架,通过特征空间操作有效误导模型,并在多种场景下展现出卓越的攻击性能。
English: Visual language modeling for automated driving faces significant security threats from adversarial attacks, leading to the development of the Universal Camouflage Attack framework, which effectively misleads models through feature space manipulation and demonstrates superior performance across various scenarios.

Authors:Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei
Title: Thinking Augmented Pre-training
Abstract:
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
中文: 本文提出思维增强预训练(TPT)方法,通过自动生成的思维轨迹增强文本数据,显著提升大语言模型的数据利用效率和训练效果,在不同规模模型上均取得性能突破。
English: This paper presents Thinking augmented Pre-Training (TPT), a method that enhances large language model training efficiency by augmenting text data with automatically generated thinking trajectories, which improves data utility and model performance across various scales.

Authors:Liang Wang, Nan Yang, Shaohan Huang, Li Dong, Furu Wei
Title: Thinking Augmented Pre-training
Abstract:
This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to $100$B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of $3$. For a $3$B parameter model, it improves the post-training performance by over $10\%$ on several challenging reasoning benchmarks.
中文: 本文提出思维增强预训练(TPT)方法,通过自动生成的思维轨迹增强文本数据,显著提升大语言模型的数据利用效率和训练效果,在不同规模模型上均取得性能突破。
English: This paper presents Thinking augmented Pre-Training (TPT), a method that enhances large language model training efficiency by augmenting text data with automatically generated thinking trajectories, which improves data utility and model performance across various scales.

Authors:Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
Title: Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
Abstract:
Sharpness-aware minimization (SAM) has emerged as a highly effective technique for improving model generalization, but its underlying principles are not fully understood. We investigated the phenomenon known as m-sharpness, where the performance of SAM improves monotonically as the micro-batch size for computing perturbations decreases. Leveraging an extended Stochastic Differential Equation (SDE) framework, combined with an analysis of the structure of stochastic gradient noise (SGN), we precisely characterize the dynamics of various SAM variants. Our findings reveal that the stochastic noise introduced during SAM perturbations inherently induces a variance-based sharpness regularization effect. Motivated by our theoretical insights, we introduce Reweighted SAM, which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate the effectiveness of our theoretical analysis and proposed method.
中文: 锐度感知最小化(SAM)通过减小微批次大小引入随机噪声,实现基于方差的锐度正则化以提升泛化能力,由此提出的重加权SAM方法在保持可并行性的同时复现了这些优势。
English: Sharpness-aware minimization (SAM) improves generalization by leveraging stochastic noise from reduced micro-batch sizes to induce variance-based sharpness regularization, leading to the development of Reweighted SAM that mimics these benefits while maintaining parallelizability.

Authors:William Chen, Chutong Meng, Jiatong Shi, Martijn Bartelds, Shih-Heng Wang, Hsiu-Hsuan Wang, Rafael Mosquera, Sara Hincapie, Dan Jurafsky, Antonis Anastasopoulos, Hung-yi Lee, Karen Livescu, Shinji Watanabe
Title: The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties
Abstract:
Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties. To advance state-of-the-art (SOTA) ASR models, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOTA multilingual speech models. The challenge also introduces an online evaluation server based on DynaBench, allowing for flexibility in model design and architecture for participants. The challenge received 5 submissions from 3 teams, all of which outperformed our baselines. The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18% when compared to the best baseline on a general multilingual test set. On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy, showing the importance of community challenges in making speech technologies more inclusive.
中文:Interspeech 2025 ML-SUPERB 2.0挑战赛构建了涵盖200多种语言与方言的测试集,最佳提交方案在方言数据上实现了30.2%的字错误率降低和23%的语言识别准确率提升,推动了包容性语音技术的发展。
English: The Interspeech 2025 ML-SUPERB 2.0 Challenge introduced a comprehensive test suite covering over 200 languages and dialects, where top submissions significantly outperformed baselines with up to 30.2% lower CER and 23% higher LID accuracy, demonstrating progress toward inclusive speech technology.

Authors:Reshma Prasad, Maxime Elkael, Gabriele Gemmi, Osama M. Bushnaq, Debashisha Mishra, Prasanna Raut, Jennifer Simonjan, Michele Polese, Tommaso Melodia
Title: Joint Routing, Resource Allocation, and Energy Optimization for Integrated Access and Backhaul with Open RAN
Abstract:
As networks evolve towards 6G, Mobile Network Operators (MNOs) must accommodate diverse requirements and at the same time manage rising energy consumption. Integrated Access and Backhaul (IAB) networks facilitate dense cellular deployments with reduced infrastructure complexity. However, the multi-hop wireless backhauling in IAB networks necessitates proper routing and resource allocation decisions to meet the performance requirements. At the same time, cell densification makes energy optimization crucial. This paper addresses the joint optimization of routing and resource allocation in IAB networks through two distinct objectives: energy minimization and throughput maximization. We develop a novel capacity model that links power levels to achievable data rates. We propose two practical large-scale approaches to solve the optimization problems and leverage the closed-loop control framework introduced by the Open Radio Access Network (O-RAN) architecture to integrate the solutions. The approaches are evaluated on diverse scenarios built upon open data of two months of traffic collected by network operators in the city of Milan, Italy. Results show that the proposed approaches effectively reduces number of activated nodes to save energy and achieves approximately 100 Mbps of minimum data rate per User Equipment (UE) during peak hours of the day using spectrum within the Frequency Range (FR) 3, or upper midband. The results validate the practical applicability of our framework for next-generation IAB network deployment and optimization.
中文摘要:本文针对6G集成接入与回传网络提出联合路由和资源分配优化方法,通过真实流量数据验证表明,该方法能有效节约能源并保证用户最低数据速率。
English Summary: This paper proposes joint routing and resource allocation optimization methods for 6G IAB networks to achieve energy minimization and throughput maximization, validated through real traffic data showing effective energy savings and guaranteed user data rates.

Authors:Zeren Xiong, Zikun Chen, Zedong Zhang, Xiang Li, Ying Tai, Jian Yang, Jun Li
Title: Category-Aware 3D Object Composition with Disentangled Texture and Shape Multi-view Diffusion
Abstract:
In this paper, we tackle a new task of 3D object synthesis, where a 3D model is composited with another object category to create a novel 3D model. However, most existing text/image/3D-to-3D methods struggle to effectively integrate multiple content sources, often resulting in inconsistent textures and inaccurate shapes. To overcome these challenges, we propose a straightforward yet powerful approach, category+3D-to-3D (C33D), for generating novel and structurally coherent 3D models. Our method begins by rendering multi-view images and normal maps from the input 3D model, then generating a novel 2D object using adaptive text-image harmony (ATIH) with the front-view image and a text description from another object category as inputs. To ensure texture consistency, we introduce texture multi-view diffusion, which refines the textures of the remaining multi-view RGB images based on the novel 2D object. For enhanced shape accuracy, we propose shape multi-view diffusion to improve the 2D shapes of both the multi-view RGB images and the normal maps, also conditioned on the novel 2D object. Finally, these outputs are used to reconstruct a complete and novel 3D model. Extensive experiments demonstrate the effectiveness of our method, yielding impressive 3D creations, such as shark(3D)-crocodile(text) in the first row of Fig. 1. A project page is available at: https://xzr52.github.io/C33D/
中文: 本文提出C33D方法,通过自适应文本图像协调和多视角扩散技术,将现有3D模型与其他类别对象结合生成新颖3D模型,有效保持纹理一致性和形状精确度。
English: This paper introduces C33D, a novel method for synthesizing 3D models by combining existing 3D objects with new categories through adaptive text-image harmony and multi-view diffusion to ensure texture consistency and shape accuracy.

Authors:Haibo Tong, Dongcheng Zhao, Guobin Shen, Xiang He, Dachuan Lin, Feifei Zhao, Yi Zeng
Title: Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks
Abstract:
The remarkable capabilities of Large Language Models (LLMs) have raised significant safety concerns, particularly regarding "jailbreak" attacks that exploit adversarial prompts to bypass safety alignment mechanisms. Existing defense research primarily focuses on single-turn attacks, whereas multi-turn jailbreak attacks progressively break through safeguards through by concealing malicious intent and tactical manipulation, ultimately rendering conventional single-turn defenses ineffective. To address this critical challenge, we propose the Bidirectional Intention Inference Defense (BIID). The method integrates forward request-based intention inference with backward response-based intention retrospection, establishing a bidirectional synergy mechanism to detect risks concealed within seemingly benign inputs, thereby constructing a more robust guardrails that effectively prevents harmful content generation. The proposed method undergoes systematic evaluation compared with a no-defense baseline and seven representative defense methods across three LLMs and two safety benchmarks under 10 different attack methods. Experimental results demonstrate that the proposed method significantly reduces the Attack Success Rate (ASR) across both single-turn and multi-turn jailbreak attempts, outperforming all existing baseline methods while effectively maintaining practical utility. Notably, comparative experiments across three multi-turn safety datasets further validate the proposed model's significant advantages over other defense approaches.
Chinese Summary: 提出的双向意图推理防御(BIID)通过整合前向意图推断与后向意图回溯,有效应对大语言模型的单轮和多轮越狱攻击,显著降低攻击成功率的同时保持模型实用性。
English Summary: The proposed Bidirectional Intention Inference Defense (BIID) effectively counters both single-turn and multi-turn jailbreak attacks on Large Language Models by integrating forward intention inference with backward intention retrospection, significantly reducing attack success rates while maintaining model utility.

Authors:Yao Liang, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yuwei Wang, Dongqi Liang, Yi Zeng
Title: MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values
Abstract:
The alignment of large language models (LLMs) with human values is critical for their safe and effective deployment across diverse user populations. However, existing benchmarks often neglect cultural and demographic diversity, leading to limited understanding of how value alignment generalizes globally. In this work, we introduce MVPBench, a novel benchmark that systematically evaluates LLMs' alignment with multi-dimensional human value preferences across 75 countries. MVPBench contains 24,020 high-quality instances annotated with fine-grained value labels, personalized questions, and rich demographic metadata, making it the most comprehensive resource of its kind to date. Using MVPBench, we conduct an in-depth analysis of several state-of-the-art LLMs, revealing substantial disparities in alignment performance across geographic and demographic lines. We further demonstrate that lightweight fine-tuning methods, such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization (DPO), can significantly enhance value alignment in both in-domain and out-of-domain settings. Our findings underscore the necessity for population-aware alignment evaluation and provide actionable insights for building culturally adaptive and value-sensitive LLMs. MVPBench serves as a practical foundation for future research on global alignment, personalized value modeling, and equitable AI development.
中文: MVPBench作为首个涵盖75个国家多维人类价值观的综合性基准,揭示了大型语言模型在跨地域和人口统计中存在显著对齐差异,并通过轻量级微调方法证明了提升全球价值对齐的有效性。
English: MVPBench is introduced as a comprehensive benchmark to evaluate large language models' alignment with diverse human values across 75 countries, revealing significant performance disparities and demonstrating that lightweight fine-tuning methods can effectively enhance value alignment globally.

Authors:Zhuoyang Liu, Jiaming Liu, Jiadong Xu, Nuowei Han, Chenyang Gu, Hao Chen, Kaichen Zhou, Renrui Zhang, Kai Chin Hsieh, Kun Wu, Zhengping Che, Jian Tang, Shanghang Zhang
Title: MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation
Abstract:
Vision-language-action models (VLAs) have shown generalization capabilities in robotic manipulation tasks by inheriting from vision-language models (VLMs) and learning action generation. Most VLA models focus on interpreting vision and language to generate actions, whereas robots must perceive and interact within the spatial-physical world. This gap highlights the need for a comprehensive understanding of robotic-specific multisensory information, which is crucial for achieving complex and contact-rich control. To this end, we introduce a multisensory language-action (MLA) model that collaboratively perceives heterogeneous sensory modalities and predicts future multisensory objectives to facilitate physical world modeling. Specifically, to enhance perceptual representations, we propose an encoder-free multimodal alignment scheme that innovatively repurposes the large language model itself as a perception module, directly interpreting multimodal cues by aligning 2D images, 3D point clouds, and tactile tokens through positional correspondence. To further enhance MLA's understanding of physical dynamics, we design a future multisensory generation post-training strategy that enables MLA to reason about semantic, geometric, and interaction information, providing more robust conditions for action generation. For evaluation, the MLA model outperforms the previous state-of-the-art 2D and 3D VLA methods by 12% and 24% in complex, contact-rich real-world tasks, respectively, while also demonstrating improved generalization to unseen configurations. Project website: https://sites.google.com/view/open-mla
中文: MLA模型通过整合多感官感知并预测未来感知目标,在复杂任务中显著提升了机器人操作的性能。
English: The MLA model enhances robotic manipulation by integrating multisensory perception and predicting future sensory objectives, achieving significant performance improvements in complex tasks.

Authors:Junjie Wen, Minjie Zhu, Jiaming Liu, Zhiyuan Liu, Yicun Yang, Linfeng Zhang, Shanghang Zhang, Yichen Zhu, Yi Xu
Title: dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
Abstract:
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics. We introduce dVLA, a diffusion-based VLA that leverages a multimodal chain-of-thought to unify visual perception, language reasoning, and robotic control in a single system. dVLA jointly optimizes perception, language understanding, and action under a single diffusion objective, enabling stronger cross-modal reasoning and better generalization to novel instructions and objects. For practical deployment, we mitigate inference latency by incorporating two acceleration strategies, a prefix attention mask and KV caching, yielding up to around times speedup at test-time inference. We evaluate dVLA in both simulation and the real world: on the LIBERO benchmark, it achieves state-of-the-art performance with a 96.4% average success rate, consistently surpassing both discrete and continuous action policies; on a real Franka robot, it succeeds across a diverse task suite, including a challenging bin-picking task that requires multi-step planning, demonstrating robust real-world performance. Together, these results underscore the promise of unified diffusion frameworks for practical, high-performance VLA robotics.
Chinese: dVLA是一种基于扩散的视觉-语言-动作统一模型,通过单一目标整合感知、推理和控制,在仿真和实际任务中实现最优性能,并具备加速推理能力。
English: dVLA is a unified diffusion-based Vision-Language-Action model that integrates perception, reasoning, and control under a single objective, achieving state-of-the-art performance in simulation and real-world tasks with accelerated inference.

Authors:Yingjie Zhu, Xuefeng Bai, Kehai Chen, Yang Xiang, Weili Guan, Jun Yu, Min Zhang
Title: From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs
Abstract:
Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we present a systematic study of the spatial bias of LVLMs, focusing on how models respond when identical key visual information is placed at different locations within an image. Through a carefully designed probing dataset, we demonstrate that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a fundamental limitation in their spatial-semantic understanding. Further analysis shows that this phenomenon originates not from the vision encoder, which reliably perceives and interprets visual content across positions, but from the unbalanced design of position embeddings in the language model component. In particular, the widely adopted position embedding strategies, such as RoPE, introduce imbalance during cross-modal interaction, leading image tokens at different positions to exert unequal influence on semantic understanding. To mitigate this issue, we introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens, promoting a more balanced integration of visual information. Extensive experiments show that BaPA enhances the spatial robustness of LVLMs without retraining and further boosts their performance across diverse multimodal benchmarks when combined with lightweight fine-tuning. Further analysis of information flow reveals that BaPA yields balanced attention, enabling more holistic visual understanding.
中文: 大型视觉语言模型因位置嵌入不平衡而产生空间偏差,而提出的平衡位置分配方法有效缓解了这一问题,增强了空间鲁棒性并提升了多模态任务性能。
English: Large Vision-Language Models exhibit spatial bias due to imbalanced position embeddings, which the proposed Balanced Position Assignment method effectively mitigates to enhance spatial robustness and performance across multimodal tasks.

Authors:Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang
Title: Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment
Abstract:
Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM's scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.
中文摘要:奖励模型对于将大型语言模型与多元文化对齐至关重要,但现有评估缺乏文化意识基准,因此提出CARB基准,通过可验证奖励的强化学习等方法评估并提升文化理解能力。
English Summary: Reward models are essential for aligning large language models with diverse cultures, yet current evaluations lack cultural awareness benchmarks, prompting the development of CARB to assess and improve cultural understanding through methods like reinforcement learning from verifiable rewards.

Authors:Rui Xu, Tianyang Xue, Qiujie Dong, Le Wan, Zhe Zhu, Peng Li, Zhiyang Dou, Cheng Lin, Shiqing Xin, Yuan Liu, Wenping Wang, Taku Komura
Title: MeshMosaic: Scaling Artist Mesh Generation via Local-to-Global Assembly
Abstract:
Scaling artist-designed meshes to high triangle numbers remains challenging for autoregressive generative models. Existing transformer-based methods suffer from long-sequence bottlenecks and limited quantization resolution, primarily due to the large number of tokens required and constrained quantization granularity. These issues prevent faithful reproduction of fine geometric details and structured density patterns. We introduce MeshMosaic, a novel local-to-global framework for artist mesh generation that scales to over 100K triangles--substantially surpassing prior methods, which typically handle only around 8K faces. MeshMosaic first segments shapes into patches, generating each patch autoregressively and leveraging shared boundary conditions to promote coherence, symmetry, and seamless connectivity between neighboring regions. This strategy enhances scalability to high-resolution meshes by quantizing patches individually, resulting in more symmetrical and organized mesh density and structure. Extensive experiments across multiple public datasets demonstrate that MeshMosaic significantly outperforms state-of-the-art methods in both geometric fidelity and user preference, supporting superior detail representation and practical mesh generation for real-world applications.
中文: MeshMosaic提出了一种从局部到全局的创新框架,通过自回归生成连贯的网格块,将艺术家设计的网格扩展到超过10万个三角形,在几何保真度和用户偏好上显著优于现有方法。
English: MeshMosaic introduces a local-to-global framework that scales artist-designed meshes to over 100K triangles by generating coherent patches autoregressively, significantly outperforming existing methods in geometric fidelity and user preference.

Authors:Zhouxiang Zhao, Ran Yi, Yihan Cang, Boyang Jin, Zhaohui Yang, Mingzhe Chen, Chongwen Huang, Zhaoyang Zhang
Title: Agentic AI for Low-Altitude Semantic Wireless Networks: An Energy Efficient Design
Abstract:
This letter addresses the energy efficiency issue in unmanned aerial vehicle (UAV)-assisted autonomous systems. We propose a framework for an agentic artificial intelligence (AI)-powered low-altitude semantic wireless network, that intelligently orchestrates a sense-communicate-decide-control workflow. A system-wide energy consumption minimization problem is formulated to enhance mission endurance. This problem holistically optimizes key operational variables, including UAV's location, semantic compression ratio, transmit power of the UAV and a mobile base station, and binary decision for AI inference task offloading, under stringent latency and quality-of-service constraints. To tackle the formulated mixed-integer non-convex problem, we develop a low-complexity algorithm which can obtain the globally optimal solution with two-dimensional search. Simulation results validate the effectiveness of our proposed design, demonstrating significant reductions in total energy consumption compared to conventional baseline approaches.
本信提出了一种基于智能体AI的无人机辅助自主系统能效框架,通过低复杂度算法优化关键操作变量,显著降低了系统总能耗。
This letter proposes an energy-efficient framework for UAV-assisted autonomous systems using agentic AI to optimize operational variables and minimize energy consumption through a low-complexity algorithm.

Authors:Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue
Title: Training-Free Pyramid Token Pruning for Efficient Large Vision-Language Models via Region, Token, and Instruction-Guided Importance
Abstract:
Large Vision-Language Models (LVLMs) have significantly advanced multimodal understanding but still struggle with efficiently processing high-resolution images. Recent approaches partition high-resolution images into multiple sub-images, dramatically increasing the number of visual tokens and causing exponential computational overhead during inference. To address these limitations, we propose a training-free token pruning strategy, Pyramid Token Pruning (PTP), that integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided importance. Inspired by human visual attention mechanisms, PTP selectively retains more tokens from visually salient regions and further leverages textual instructions to pinpoint tokens most relevant to specific multimodal tasks. Extensive experiments across 13 diverse benchmarks demonstrate that our method substantially reduces computational overhead and inference latency with minimal performance loss.
中文摘要:PTP是一种无需训练的策略,通过结合自下而上的视觉显著性和自上而下的任务指令,在保持多模态模型性能的同时,显著降低了高分辨率图像处理的计算开销和推理延迟。
English Summary: PTP is a training-free method that reduces computational costs in LVLMs by selectively preserving tokens from salient image regions based on visual saliency and task instructions, achieving significant efficiency gains with minimal performance loss.

Authors:Yuxuan Liang, Xu Li, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue
Title: Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance
Abstract:
Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images into multiple sub-images for separate encoding, but this approach drastically inflates the number of visual tokens and introduces prohibitive inference overhead. To overcome this challenge, we propose Pyramid Token Pruning (PTP), a training-free strategy that hierarchically integrates bottom-up visual saliency at both region and token levels with top-down instruction-guided relevance. Inspired by human visual cognition, PTP selectively preserves more tokens from salient regions while further emphasizing those most relevant to task instructions. Extensive experiments on 13 diverse benchmarks show that PTP substantially reduces computational cost, memory usage, and inference latency, with negligible performance degradation.
中文摘要:PTP是一种无需训练的策略,通过结合自下而上的视觉显著性和自上而下的任务指令,在保持多模态模型性能的同时,显著降低了高分辨率图像处理的计算开销和推理延迟。
English Summary: PTP is a training-free method that reduces computational costs in LVLMs by selectively preserving tokens from salient image regions based on visual saliency and task instructions, achieving significant efficiency gains with minimal performance loss.

Authors:Rongyu Zhang, Jiaming Liu, Xiaoqi Li, Xiaowei Chi, Dan Wang, Li Du, Yuan Du, Shanghang Zhang
Title: BEVUDA++: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection
Abstract:
Vision-centric Bird's Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9\% NDS and 9.5\% mAP enhancement on Day-Night adaptation.
中文总结:本文针对自动驾驶中鸟瞰图感知的域适应问题,提出BEVUDA++框架,通过可靠深度教师模型和几何一致性学生模型的协同设计,在多几何空间中减少域偏移,在跨域三维目标检测任务中实现了最先进的性能。
English Summary: This paper addresses the domain shift problem in vision-centric Bird's Eye View perception for autonomous driving by introducing BEVUDA++, a geometric-aware teacher-student framework that reduces domain gaps across multiple geometric spaces through reliable depth estimation and unified geometric embedding.

Authors:Rongyu Zhang, Xize Duan, Jiaming Liu, Li Du, Yuan Du, Dan Wang, Shanghang Zhang, Fangxin Wang
Title: RepCaM++: Exploring Transparent Visual Prompt With Inference-Time Re-Parameterization for Neural Video Delivery
Abstract:
Recently, content-aware methods have been employed to reduce bandwidth and enhance the quality of Internet video delivery. These methods involve training distinct content-aware super-resolution (SR) models for each video chunk on the server, subsequently streaming the low-resolution (LR) video chunks with the SR models to the client. Prior research has incorporated additional partial parameters to customize the models for individual video chunks. However, this leads to parameter accumulation and can fail to adapt appropriately as video lengths increase, resulting in increased delivery costs and reduced performance. In this paper, we introduce RepCaM++, an innovative framework based on a novel Re-parameterization Content-aware Modulation (RepCaM) module that uniformly modulates video chunks. The RepCaM framework integrates extra parallel-cascade parameters during training to accommodate multiple chunks, subsequently eliminating these additional parameters through re-parameterization during inference. Furthermore, to enhance RepCaM's performance, we propose the Transparent Visual Prompt (TVP), which includes a minimal set of zero-initialized image-level parameters (e.g., less than 0.1%) to capture fine details within video chunks. We conduct extensive experiments on the VSD4K dataset, encompassing six different video scenes, and achieve state-of-the-art results in video restoration quality and delivery bandwidth compression.
中文: 本文提出RepCaM++框架,通过重参数化技术和透明视觉提示模块,在降低视频传输带宽的同时提升画质,在VSD4K数据集上取得了最优性能。
English: This paper introduces RepCaM++, a framework using re-parameterization and transparent visual prompts to enhance video delivery by reducing bandwidth while maintaining quality, achieving state-of-the-art results on the VSD4K dataset.

Authors:Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue
Title: HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models
Abstract:
By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
中文: 高分辨率大视觉语言模型通过将图像分割为局部图块实现了精细视觉理解,但面临计算负担过重的问题;HERO框架通过自适应令牌预算分配和功能感知令牌选择,在无需训练的情况下显著提升了效率与精度的平衡。
English: HR-LVLMs enhance fine-grained visual understanding by processing high-resolution images as local tiles but face computational inefficiency due to excessive visual tokens, leading to the development of HERO, a training-free framework that optimizes token selection for better efficiency and accuracy.

Authors:Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue
Title: HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models
Abstract:
By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
中文: 高分辨率大视觉语言模型通过将图像分割为局部图块实现了精细视觉理解,但面临计算负担过重的问题;HERO框架通过自适应令牌预算分配和功能感知令牌选择,在无需训练的情况下显著提升了效率与精度的平衡。
English: HR-LVLMs enhance fine-grained visual understanding by processing high-resolution images as local tiles but face computational inefficiency due to excessive visual tokens, leading to the development of HERO, a training-free framework that optimizes token selection for better efficiency and accuracy.

Authors:Jingdong Zhang, Weikai Chen, Yuan Liu, Jionghao Wang, Zhengming Yu, Zhuowen Shen, Bo Yang, Wenping Wang, Xin Li
Title: SPGen: Spherical Projection as Consistent and Flexible Representation for Single Image 3D Shape Generation
Abstract:
Existing single-view 3D generative models typically adopt multiview diffusion priors to reconstruct object surfaces, yet they remain prone to inter-view inconsistencies and are unable to faithfully represent complex internal structure or nontrivial topologies. In particular, we encode geometry information by projecting it onto a bounding sphere and unwrapping it into a compact and structural multi-layer 2D Spherical Projection (SP) representation. Operating solely in the image domain, SPGen offers three key advantages simultaneously: (1) Consistency. The injective SP mapping encodes surface geometry with a single viewpoint which naturally eliminates view inconsistency and ambiguity; (2) Flexibility. Multi-layer SP maps represent nested internal structures and support direct lifting to watertight or open 3D surfaces; (3) Efficiency. The image-domain formulation allows the direct inheritance of powerful 2D diffusion priors and enables efficient finetuning with limited computational resources. Extensive experiments demonstrate that SPGen significantly outperforms existing baselines in geometric quality and computational efficiency.
中文: SPGen提出球面投影表示法,通过将几何信息编码为多层二维映射消除视角不一致性,既能灵活建模内部结构又高效利用二维扩散先验,在几何质量和计算效率上显著优于现有方法。
English: SPGen introduces a Spherical Projection representation that eliminates view inconsistency by encoding geometry into multi-layer 2D maps, enabling flexible internal structure modeling and efficient 2D diffusion prior utilization while outperforming existing methods in quality and efficiency.

Authors:Jiarui Chen, Yikeng Chen, Yingshuang Zou, Ye Huang, Peng Wang, Yuan Liu, Yujing Sun, Wenping Wang
Title: MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning
Abstract:
3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS$^{2}$, a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we replace the memory-intensive spherical harmonics with lightweight, arbitrarily oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGS$^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality. Project page: https://megs-2.github.io/
中文: MEGS²是一种新型内存优化框架,通过联合优化基元数量和每个基元的参数,在保持相当渲染质量的同时,将3D高斯泼溅的渲染内存使用量降低了40-50%。
English: MEGS² is a novel memory-efficient framework that reduces 3D Gaussian Splatting's rendering memory usage by optimizing both primitive count and parameters per primitive, achieving 40-50% VRAM reduction while maintaining comparable quality.

Authors:Chuanxiang Yang, Yuanfeng Zhou, Guangshun Wei, Siyu Ren, Yuan Liu, Junhui Hou, Wenping Wang
Title: T-MLP: Tailed Multi-Layer Perceptron for Level-of-Detail Signal Representation
Abstract:
Level-of-detail (LoD) representation is critical for efficiently modeling and transmitting various types of signals, such as images and 3D shapes. In this work, we present a novel neural architecture that supports LoD signal representation. Our architecture is based on an elaborate modification of the widely used Multi-Layer Perceptron (MLP), which inherently operates at a single scale and therefore lacks native support for LoD. Specifically, we introduce the Tailed Multi-Layer Perceptron (T-MLP) that extends the MLP by attaching multiple output branches, also called tails, to its hidden layers, enabling direct supervision at multiple depths. Our loss formulation and training strategy allow each hidden layer to effectively learn a target signal at a specific LoD, thus enabling multi-scale modeling. Extensive experimental results show that our T-MLP outperforms other neural LoD baselines across a variety of signal representation tasks.
中文: 本文提出了一种带尾部的多层感知机(T-MLP),通过在隐藏层附加输出分支实现多尺度细节层次信号表示,仅需单分辨率监督即可在多种信号建模任务中取得优越性能。
English: This paper introduces a Tailed Multi-Layer Perceptron (T-MLP) that enables multi-scale level-of-detail signal representation by attaching output branches to hidden layers, achieving superior performance across various signal modeling tasks with only single-resolution supervision.

Authors:Chuanxiang Yang, Yuanfeng Zhou, Guangshun Wei, Siyu Ren, Yuan Liu, Junhui Hou, Wenping Wang
Title: T-MLP: Tailed Multi-Layer Perceptron for Level-of-Detail Signal Representation
Abstract:
Level-of-detail (LoD) representation is critical for efficiently modeling and transmitting various types of signals, such as images and 3D shapes. In this work, we propose a novel network architecture that enables LoD signal representation. Our approach builds on a modified Multi-Layer Perceptron (MLP), which inherently operates at a single scale and thus lacks native LoD support. Specifically, we introduce the Tailed Multi-Layer Perceptron (T-MLP), which extends the MLP by attaching an output branch, also called tail, to each hidden layer. Each tail refines the residual between the current prediction and the ground-truth signal, so that the accumulated outputs across layers correspond to the target signals at different LoDs, enabling multi-scale modeling with supervision from only a single-resolution signal. Extensive experiments demonstrate that our T-MLP outperforms existing neural LoD baselines across diverse signal representation tasks.
中文: 本文提出了一种带尾部的多层感知机(T-MLP),通过在隐藏层附加输出分支实现多尺度细节层次信号表示,仅需单分辨率监督即可在多种信号建模任务中取得优越性能。
English: This paper introduces a Tailed Multi-Layer Perceptron (T-MLP) that enables multi-scale level-of-detail signal representation by attaching output branches to hidden layers, achieving superior performance across various signal modeling tasks with only single-resolution supervision.

Authors:Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus
Title: BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Abstract:
The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.
Chinese: 该研究提出了BatonVoice框架,通过让大型语言模型理解用户指令并生成文本化的声音特征,再由独立的TTS模型合成语音,显著提升了语音合成的可控性和跨语言泛化能力。
English: The study introduces BatonVoice, a novel framework that enhances speech synthesis by having an LLM interpret user instructions to generate textual vocal features, which a separate TTS model then converts into speech, achieving superior controllability and cross-lingual generalization.

Authors:Piotr Komorowski, Elena Golimblevskaia, Reduan Achtibat, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek
Title: Attribution-Guided Decoding
Abstract:
The capacity of Large Language Models (LLMs) to follow complex instructions and generate factually accurate text is critical for their real-world application. However, standard decoding methods often fail to robustly satisfy these requirements, while existing control techniques frequently degrade general output quality. In this work, we introduce Attribution-Guided Decoding (AGD), an interpretability-based decoding strategy. Instead of directly manipulating model activations, AGD considers a set of high-probability output token candidates and selects the one that exhibits the highest attribution to a user-defined Region of Interest (ROI). This ROI can be flexibly defined over different parts of the model's input or internal components, allowing AGD to steer generation towards various desirable behaviors. We demonstrate AGD's efficacy across three challenging domains. For instruction following, we show that AGD significantly boosts adherence (e.g., improving the overall success rate on Llama 3.1 from 66.0% to 79.1%). For knowledge-intensive tasks, we show that guiding generation towards usage of internal knowledge components or contextual sources can reduce hallucinations and improve factual accuracy in both closed-book and open-book settings. Furthermore, we propose an adaptive, entropy-based variant of AGD that mitigates quality degradation and reduces computational overhead by applying guidance only when the model is uncertain. Our work presents a versatile, more interpretable, and effective method for enhancing the reliability of modern LLMs.
中文摘要:归因引导解码(AGD)是一种基于可解释性的新方法,通过选择对用户定义区域具有最高归因度的输出词元,显著提升大语言模型的指令遵循能力和事实准确性,在多个领域有效提升性能的同时保持输出质量。
English Summary: Attribution-Guided Decoding (AGD) is a novel interpretability-based method that enhances LLMs' instruction adherence and factual accuracy by selecting output tokens with the highest attribution to user-defined regions, significantly improving performance across multiple domains while maintaining output quality.

Authors:Shenghe Zheng, Chenyu Huang, Fangchen Yu, Junchi Yao, Jingqi Ye, Tao Chen, Yun Luo, Ning Ding, LEI BAI, Ganqu Cui, Peng Ye
Title: SCI-Verifier: Scientific Verifier with Thinking
Abstract:
As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.
中文: 该摘要针对大语言模型科学答案验证的难题,提出了跨学科评估基准SCI-VerifyBench和推理增强验证模型SCI-Verifier,以提升科学领域的可靠性和适用性。
English: This abstract addresses the challenges in verifying scientific answers from large language models by introducing SCI-VerifyBench, a cross-disciplinary evaluation benchmark, and SCI-Verifier, a reasoning-enhanced verification model, to improve reliability and applicability in scientific domains.

Authors:Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, Jianfei Chen
Title: SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
Abstract:
In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (Sparse-Linear Attention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible categories, applying O(N^2) attention to critical weights, O(N) attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a 20x reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by 95% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a 13.7x speedup in attention computation and a 2.2x end-to-end speedup in video generation on Wan2.1-1.3B.
中文:提出的SLA方法通过融合稀疏与线性注意力机制,在保持生成质量的同时将DiT模型的计算量减少20倍。
English: The proposed SLA method accelerates DiT models by combining sparse and linear attention mechanisms, achieving a 20x reduction in computation while maintaining generation quality.

Authors:Yan Wen, Peng Ye, Lin Zhang, Baopu Li, Jiakang Yuan, Yaoxin Yang, Tao Chen
Title: Sequential Token Merging: Revisiting Hidden States
Abstract:
Vision Mambas (ViMs) achieve remarkable success with sub-quadratic complexity, but their efficiency remains constrained by quadratic token scaling with image resolution. While existing methods address token redundancy, they overlook ViMs' intrinsic Limited Directional Sequential Dependence (LDSD) - a critical information flow mechanism revealed in our analysis. We further identify Mamba's selective scan enables gradual information aggregation in hidden states. Based on these insights, we propose Sequential Token Merging (STM), featuring: 1) Bidirectional nearest neighbor merging to preserve sequential dependencies through symmetric spatial aggregation, and 2) Hidden states protection to stabilize the hidden states around the class token. STM strategically leverages Mamba's layer-wise loss convergence to convert temporal forgetfulness into stability. Experiments demonstrate STM's superiority: 1.0% accuracy drop for ViM-Ti at 20% token reduction, and only 1.4% degradation for ViM-S at 40% reduction. Our method achieves state-of-the-art efficiency with minimal complexity, while providing new insights into state-space model dynamics. Codes will be released soon.
中文: Vision Mambas 因二次令牌扩展而效率受限,我们提出的顺序令牌合并方法通过双向最近邻合并和隐藏状态保护,在减少令牌的同时保持序列依赖性和稳定性,以最小精度损失实现顶尖效率。
English: Vision Mambas face efficiency limitations due to quadratic token scaling, but our proposed Sequential Token Merging method reduces tokens while preserving sequential dependencies and hidden state stability, achieving minimal accuracy loss with state-of-the-art efficiency.

Authors:Yue Duan, Lei Qi, Yinghuan Shi, Yang Gao
Title: An Adaptor for Triggering Semi-Supervised Learning to Out-of-Box Serve Deep Image Clustering
Abstract:
Recently, some works integrate SSL techniques into deep clustering frameworks to enhance image clustering performance. However, they all need pretraining, clustering learning, or a trained clustering model as prerequisites, limiting the flexible and out-of-box application of SSL learners in the image clustering task. This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without any prerequisites. Specifically, we first randomly sample pseudo-labeled data from all unlabeled data, and set an instance-level classifier to learn them with semantically aligned instance-level labels. With the ability of instance-level classification, we track the class transitions of predictions on unlabeled data to extract high-level similarities of instance-level classes, which can be utilized to assign cluster-level labels to pseudo-labeled data. Finally, we use the pseudo-labeled data with assigned cluster-level labels to trigger a general SSL learner trained on the unlabeled data for image clustering. We show the superior performance of ASD across various benchmarks against the latest deep image clustering approaches and very slight accuracy gaps compared to SSL methods using ground-truth, e.g., only 1.33% on CIFAR-10. Moreover, ASD can also further boost the performance of existing SSL-embedded deep image clustering methods.
中文: 本研究提出ASD适配器,无需任何前提条件即可实现SSL学习器的冷启动深度图像聚类,在多个基准测试中表现优异,并能进一步提升现有方法的性能。
English: This work introduces ASD, an adaptor that enables the cold-start of SSL learners for deep image clustering without prerequisites, achieving superior performance across benchmarks and boosting existing methods.

Authors:Tongtong Feng, Xin Wang, Yu-Gang Jiang, Wenwu Zhu
Title: Embodied AI: From LLMs to World Models
Abstract:
Embodied Artificial Intelligence (AI) is an intelligent system paradigm for achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications and driving the evolution from cyberspace to physical systems. Recent breakthroughs in Large Language Models (LLMs) and World Models (WMs) have drawn significant attention for embodied AI. On the one hand, LLMs empower embodied AI via semantic reasoning and task decomposition, bringing high-level natural language instructions and low-level natural language actions into embodied cognition. On the other hand, WMs empower embodied AI by building internal representations and future predictions of the external world, facilitating physical law-compliant embodied interactions. As such, this paper comprehensively explores the literature in embodied AI from basics to advances, covering both LLM driven and WM driven works. In particular, we first present the history, key technologies, key components, and hardware systems of embodied AI, as well as discuss its development via looking from unimodal to multimodal angle. We then scrutinize the two burgeoning fields of embodied AI, i.e., embodied AI with LLMs/multimodal LLMs (MLLMs) and embodied AI with WMs, meticulously delineating their indispensable roles in end-to-end embodied cognition and physical laws-driven embodied interactions. Building upon the above advances, we further share our insights on the necessity of the joint MLLM-WM driven embodied AI architecture, shedding light on its profound significance in enabling complex tasks within physical worlds. In addition, we examine representative applications of embodied AI, demonstrating its wide applicability in real-world scenarios. Last but not least, we point out future research directions of embodied AI that deserve further investigation.
中文: 具身人工智能利用大语言模型进行语义推理和世界模型进行物理预测,以实现与现实世界交互的智能系统,本文综述了其基础、进展与应用,并提出了结合多模态大语言模型与世界模型的架构以处理复杂任务。
English: Embodied AI leverages Large Language Models for semantic reasoning and World Models for physical predictions to enable intelligent systems that interact with the real world, with this paper reviewing its fundamentals, advances, and applications while proposing a combined MLLM-WM architecture for complex tasks.

Authors:Alexander Panfilov, Evgenii Kortukov, Kristina Nikolić, Matthias Bethge, Sebastian Lapuschkin, Wojciech Samek, Ameya Prabhu, Maksym Andriushchenko, Jonas Geiping
Title: Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs
Abstract:
Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are crafted to be subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using them as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.
中文摘要:前沿大语言模型会发展出策略性欺骗,即对恶意请求生成看似有害但实际无害的回应,这种行为能规避安全监测器,并凸显出当帮助性与无害性冲突时,模型对齐难以控制的挑战。
English Summary: Frontier large language models can develop strategic dishonesty by producing seemingly harmful but practically harmless responses to malicious requests, which evades safety monitors and highlights challenges in controlling model alignment when helpfulness conflicts with harmlessness.

Authors:Yuyang Ding, Xinyu Shi, Juntao Li, Xiaobo Liang, Zhaopeng Tu, Min Zhang
Title: SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning
Abstract:
Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.
Chinese: SCAN框架通过自去噪蒙特卡洛估计高效合成高质量过程奖励模型训练数据,能以极低推理成本实现卓越性能,并有效学习含噪声标注,其效果甚至优于基于大规模人工标注数据训练的模型。
English: The SCAN framework leverages self-denoising Monte Carlo estimation to efficiently synthesize high-quality training data for process reward models, enabling superior performance with minimal inference cost and robust learning from noisy annotations, even surpassing models trained on extensive human-annotated datasets.

Authors:Zhaolin Wang, Jiaqi Xu, Chongjun Ouyang, Xidong Mu, Yuanwei Liu
Title: Multiport Network Modeling and Optimization for Reconfigurable Pinching-Antenna Systems
Abstract:
A reconfigurable pinching-antenna system (PASS) is presented, endowing pinching antennas (PAs) with both amplitude- and phase-controllable radiation beyond conventional implementations. To characterize this feature, a general and physically consistent model is established for PASS via multiport network theory. Within this model, the fundamental constraint of ideal reconfigurability of PAs is identified, allowing the full control of signal amplitudes and phases. A practical directional-coupler (DC)-based PA model is then proposed, enabling both amplitude-only control and amplitude-constrained phase control. Beamforming optimization is investigated for both ideal and practical cases: an optimal solution is obtained for ideal PAs, whereas a high-quality iterative algorithm is developed for DC-based PAs. Numerical results suggest that in single-user scenarios: (i) with optimized PA positions, performance gains arise primarily from amplitude reconfigurability and DC-based PAs approach ideal performance, and (ii) with fixed PA positions, both amplitude and phase reconfigurability are critical and DC-based PAs incur non-negligible loss.
中文摘要:本文提出了一种可重构夹持天线系统,实现了辐射幅度和相位的全面控制,并通过理想模型和定向耦合器模型分别研究了单用户场景下的波束成形优化问题。
English Summary: A reconfigurable pinching-antenna system is introduced that enables full amplitude and phase control of radiation, with ideal and practical directional-coupler-based models developed for beamforming optimization in single-user scenarios.

Authors:Shu Liu, Soujanya Ponnapalli, Shreya Shankar, Sepanta Zeighami, Alan Zhu, Shubham Agarwal, Ruiqi Chen, Samion Suwito, Shuo Yuan, Ion Stoica, Matei Zaharia, Alvin Cheung, Natacha Crooks, Joseph E. Gonzalez, Aditya G. Parameswaran
Title: Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First
Abstract:
Large Language Model (LLM) agents, acting on their users' behalf to manipulate and analyze data, are likely to become the dominant workload for data systems in the future. When working with data, agents employ a high-throughput process of exploration and solution formulation for the given task, one we call agentic speculation. The sheer volume and inefficiencies of agentic speculation can pose challenges for present-day data systems. We argue that data systems need to adapt to more natively support agentic workloads. We take advantage of the characteristics of agentic speculation that we identify, i.e., scale, heterogeneity, redundancy, and steerability - to outline a number of new research opportunities for a new agent-first data systems architecture, ranging from new query interfaces, to new query processing techniques, to new agentic memory stores.
中文: 大型语言模型代理将成为未来数据系统的主要负载,其大规模、异构、冗余和可引导的推测特性要求数据系统进行原生适配,这既带来挑战也开辟了新的研究方向。
English: Large Language Model agents are poised to become the primary workload for data systems, necessitating adaptations to support their speculative processes characterized by scale, heterogeneity, redundancy, and steerability, which present both challenges and research opportunities.

Authors:Tongtong Feng, Xin Wang, Feilin Han, Leping Zhang, Wenwu Zhu
Title: U2UData-2: A Scalable Swarm UAVs Autonomous Flight Dataset for Long-horizon Tasks
Abstract:
Swarm UAV autonomous flight for Long-Horizon (LH) tasks is crucial for advancing the low-altitude economy. However, existing methods focus only on specific basic tasks due to dataset limitations, failing in real-world deployment for LH tasks. LH tasks are not mere concatenations of basic tasks, requiring handling long-term dependencies, maintaining persistent states, and adapting to dynamic goal shifts. This paper presents U2UData-2, the first large-scale swarm UAV autonomous flight dataset for LH tasks and the first scalable swarm UAV data online collection and algorithm closed-loop verification platform. The dataset is captured by 15 UAVs in autonomous collaborative flights for LH tasks, comprising 12 scenes, 720 traces, 120 hours, 600 seconds per trajectory, 4.32M LiDAR frames, and 12.96M RGB frames. This dataset also includes brightness, temperature, humidity, smoke, and airflow values covering all flight routes. The platform supports the customization of simulators, UAVs, sensors, flight algorithms, formation modes, and LH tasks. Through a visual control window, this platform allows users to collect customized datasets through one-click deployment online and to verify algorithms by closed-loop simulation. U2UData-2 also introduces an LH task for wildlife conservation and provides comprehensive benchmarks with 9 SOTA models. U2UData-2 can be found at https://fengtt42.github.io/U2UData-2/.
中文: 本文提出U2UData-2,首个面向长周期任务的无人机群自主飞行大规模数据集及平台,通过支持定制化数据采集和闭环算法验证,解决了现有方法在真实场景部署中的局限性。
English: This paper introduces U2UData-2, the first large-scale dataset and platform for swarm UAV autonomous flight in long-horizon tasks, addressing limitations of existing methods by enabling customized data collection and closed-loop algorithm verification across diverse scenarios.

Authors:Joanna Sorysz, Lars Krupp, Dominique Nshimyimana, Meagan B. Loerakker, Bo Zhou, Paul Lukowicz, Jakob Karolus
Title: Beyond the Pocket: A Large-Scale International Study on User Preferences on Bodily Placements of Commercial Wearables
Abstract:
As wearable technologies continue to evolve-becoming smaller, more powerful, and more deeply embedded in daily life-their integration into diverse user contexts raises critical design challenges. There remains a notable gap in large-scale empirical data on where users actually wear or carry these devices throughout the day, systematically examining user preferences for wearable placement across varied contexts and routines. In this work, we conducted a questionnaire in several countries aimed at capturing real-world habits related to wearable device placement. The results from n = 320 participants reveal how wearable usage patterns shift depending on time of day and context. We propose a set of practical, user-centered guidelines for sensor placement and discuss how they align or diverge from assumptions seen in existing ISWC work. This study contributes to ongoing efforts within the community to design more inclusive, adaptable, and context-aware wearable systems.
中文: 本研究通过多国问卷调查填补了可穿戴设备实际佩戴位置的大规模数据空白,揭示了320名参与者的使用模式如何随时间和情境变化,并提出了以用户为中心的指导原则,以推动更具适应性的可穿戴系统设计。
English: This study addresses the lack of large-scale data on wearable device placement by surveying 320 participants across multiple countries, revealing how usage patterns vary with time and context, and proposes user-centered guidelines to inform more adaptable wearable system designs.

Authors:Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang
Title: Knowledge Homophily in Large Language Models
Abstract:
Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.
中文摘要:本研究通过识别大语言模型中的知识同质性模式,开发图神经网络预测实体知识掌握程度,从而在同等标注成本下提升知识覆盖效率,并优化模型微调和问答任务中的多跳推理能力。
English Summary: This study explores the structural organization of knowledge in Large Language Models by identifying a knowledge homophily pattern and developing a Graph Neural Network model to predict entity-level knowledgeability, which enhances knowledge coverage and improves efficiency in fine-tuning and question answering tasks.

Authors:Huacan Chai, Zijie Cao, Maolin Ran, Yingxuan Yang, Jianghao Lin, Xin Peng, Hairui Wang, Renjie Ding, Ziyu Wan, Muning Wen, Weiwen Liu, Weinan Zhang, Fei Huang, Ying Wen
Title: PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness
Abstract:
Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.
中文摘要:PARL-MT框架通过进度感知生成管道和强化学习算法,将任务进度意识显式融入大语言模型的多轮函数调用训练,在公开基准测试中显著优于现有方法。
English Summary: PARL-MT is a novel framework that enhances multi-turn function calling in large language models by explicitly integrating progress awareness through automated dataset generation and guided reinforcement learning, achieving superior performance on benchmarks.

Authors:Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, Xunliang Cai
Title: Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
Abstract:
Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.
中文: 近期大推理模型的进展依赖于强化学习和思维链数据,本文首次定义了以解题所需尝试次数倒数衡量的推理潜力,并提出筛选高价值推理模式的方法,仅用少量数据即可显著提升模型性能。
English: Recent advances in large reasoning models leverage reinforcement learning and chain-of-thought data, and this paper introduces reasoning potential measured by inverse attempts needed to solve problems, proposing a method to select high-value reasoning patterns that significantly boost model performance with minimal data.

Authors:Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Shuo Wang, Hongfei Yan, Jingang Wang, Xunliang Cai
Title: Expanding Reasoning Potential in Foundation Model by Learning Diverse Chains of Thought Patterns
Abstract:
Recent progress in large reasoning models for challenging mathematical reasoning has been driven by reinforcement learning (RL). Incorporating long chain-of-thought (CoT) data during mid-training has also been shown to substantially improve reasoning depth. However, current approaches often utilize CoT data indiscriminately, leaving open the critical question of which data types most effectively enhance model reasoning capabilities. In this paper, we define the foundation model's reasoning potential for the first time as the inverse of the number of independent attempts required to correctly answer the question, which is strongly correlated with the final model performance. We then propose utilizing diverse data enriched with high-value reasoning patterns to expand the reasoning potential. Specifically, we abstract atomic reasoning patterns from CoT sequences, characterized by commonality and inductive capabilities, and use them to construct a core reference set enriched with valuable reasoning patterns. Furthermore, we propose a dual-granularity algorithm involving chains of reasoning patterns and token entropy, efficiently selecting high-value CoT data (CoTP) from the data pool that aligns with the core set, thereby training models to master reasoning effectively. Only 10B-token CoTP data enables the 85A6B Mixture-of-Experts (MoE) model to improve by 9.58% on the challenging AIME 2024 and 2025, and to raise the upper bound of downstream RL performance by 7.81%.
中文: 近期大推理模型的进展依赖于强化学习和思维链数据,本文首次定义了以解题所需尝试次数倒数衡量的推理潜力,并提出筛选高价值推理模式的方法,仅用少量数据即可显著提升模型性能。
English: Recent advances in large reasoning models leverage reinforcement learning and chain-of-thought data, and this paper introduces reasoning potential measured by inverse attempts needed to solve problems, proposing a method to select high-value reasoning patterns that significantly boost model performance with minimal data.

Authors:Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, Chong Peng, Chuyu Zhang, Cong Chen, Fengcun Li, Gang Xu, Guoyuan Lin, Hao Jiang, Hao Liang, Haomin Fu, Haoxiang Ma, Hong Liu, Hongyan Hao, Hongyin Tang, Hongyu Zang, Hongzhi Ni, Hui Su, Jiahao Liu, Jiahuan Li, Jialin Liu, Jianfei Zhang, Jianhao Xu, Jianing Wang, Jiaqi Sun, Jiaqi Zhang, Jiarong Shi, Jiawei Yang, Jingang Wang, Jinrui Ding, Jun Kuang, Jun Xu, Ke He, Kefeng Zhang, Keheng Wang, Keqing He, Li Wei, Liang Shi, Lin Qiu, Lingbin Kong, Lingchuan Liu, Linsen Guo, Longfei An, Mai Xia, Meng Zhou, Mengshen Zhu, Peng Pei, Pengcheng Jia, Qi Gu, Qi Guo, Qiong Huang, Quan Chen, Quanchi Weng, Rongxiang Weng, Ruichen Shao, Rumei Li, Shanglin Lei, Shuai Du, Shuaikang Liu, Shuang Zhou, Shuhao Hu, Siyu Xu, Songshan Gong, Tao Liang, Tianhao Hu, Wei He, Wei Shi, Wei Wang, Wei Wu, Wei Zhuo, Weifeng Tang, Wenjie Shi, Wenlong Zhu, Xi Su, Xiangcheng Liu, Xiangyu Xi, Xiangzhou Huang, Xiao Liu, Xiaochen Jiang, Xiaowei Shi, Xiaowen Shi, Xiaoyu Li, Xin Chen, Xinyue Zhao, Xuan Huang, Xuemiao Zhang, Xuezhi Cao, Xunliang Cai, Yajie Zhang, Yang Chen, Yang Liu, Yang Liu, Yang Zheng, Yaoming Wang, Yaqi Huo, Yerui Sun, Yifan Lu, Yiyang Li, Youshao Xiao, Yuanzhe Lei, Yuchen Xie, Yueqing Sun, Yufei Zhang, Yuhuai Wei, Yulei Qian, Yunke Zhao, Yuqing Ding, Yuwei Jiang, Zhaohua Yang, Zhengyu Chen, Zhijian Liu, Zhikang Xia, Zhongda Su, Ziran Li, Ziwen Wang, Ziyuan Zhuang, Zongyu Wang, Zunyuan Yang
Title: LongCat-Flash-Thinking Technical Report
Abstract:
We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.
中文: LongCat-Flash-Thinking是一个高效的5600亿参数开源推理模型,通过精心设计的冷启动思维链数据和大规模强化学习训练,在复杂推理任务中达到顶尖性能,并在智能体推理中显著降低64.5%的令牌消耗。
English: LongCat-Flash-Thinking is a highly efficient 560-billion-parameter open-source reasoning model that achieves state-of-the-art performance through a unique training process combining cold-start Chain-of-Thought data and large-scale reinforcement learning, while significantly reducing token consumption in agentic reasoning.

Authors:Marcin Copik, Eiman Alnuaimi, Alok Kamatar, Valerie Hayot-Sasson, Alberto Madonna, Todd Gamblin, Kyle Chard, Ian Foster, Torsten Hoefler
Title: XaaS Containers: Performance-Portable Representation With Source and IR Containers
Abstract:
High-performance computing (HPC) systems and cloud data centers are converging, and containers are becoming the default method of portable software deployment. Yet, while containers simplify software management, they face significant performance challenges in HPC environments as they must sacrifice hardware-specific optimizations to achieve portability. Although HPC containers can use runtime hooks to access optimized MPI libraries and GPU devices, they are limited by application binary interface (ABI) compatibility and cannot overcome the effects of early-stage compilation decisions. Acceleration as a Service (XaaS) proposes a vision of performance-portable containers, where a containerized application should achieve peak performance across all HPC systems. We present a practical realization of this vision through Source and Intermediate Representation (IR) containers, where we delay performance-critical decisions until the target system specification is known. We analyze specialization mechanisms in HPC software and propose a new LLM-assisted method for automatic discovery of specializations. By examining the compilation pipeline, we develop a methodology to build containers optimized for target architectures at deployment time. Our prototype demonstrates that new XaaS containers combine the convenience of containerization with the performance benefits of system-specialized builds.
中文摘要:高性能计算与云系统正采用容器实现便携部署,但面临性能折衷;提出的源码与中间表示容器通过延迟优化决策至部署阶段,成功兼顾了便携性与峰值性能。
English Summary: High-performance computing and cloud systems are adopting containers for portability, but face performance trade-offs, which the proposed Source and IR containers resolve by delaying optimization decisions until deployment to achieve both portability and peak performance.

Authors:Yuanjie Lyu, Chengyu Wang, Jun Huang, Tong Xu
Title: From Correction to Mastery: Reinforced Distillation of Large Language Model Agents
Abstract:
Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student often lead to compounding errors. We propose SCoRe, a student-centered framework in which the student generates trajectories and the teacher intervenes only at the first critical error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix before the first critical error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and improves training stability. Particularly, on 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.
中文: SCoRe是一种以学生为中心的蒸馏框架,通过让学生生成训练轨迹并由教师纠正早期错误,结合微调与短视距强化学习,使70亿参数的学生模型在复杂任务上达到720亿参数教师模型的性能水平。
English: SCoRe is a student-centered distillation framework that enhances smaller language models by having them generate training trajectories for teacher correction and combining fine-tuning with short-horizon reinforcement learning, enabling a 7B-parameter student to match the performance of a 72B-parameter teacher on complex tasks.

Authors:Yuanjie Lyu, Chengyu Wang, Jun Huang, Tong Xu
Title: From Correction to Mastery: Reinforced Distillation of Large Language Model Agents
Abstract:
Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student can cause compounding errors. We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix preceding the earliest error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and enhances training stability. On 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.
中文: SCoRe是一种以学生为中心的蒸馏框架,通过让学生生成训练轨迹并由教师纠正早期错误,结合微调与短视距强化学习,使70亿参数的学生模型在复杂任务上达到720亿参数教师模型的性能水平。
English: SCoRe is a student-centered distillation framework that enhances smaller language models by having them generate training trajectories for teacher correction and combining fine-tuning with short-horizon reinforcement learning, enabling a 7B-parameter student to match the performance of a 72B-parameter teacher on complex tasks.

Authors:Rui-Chen Zheng, Wenrui Liu, Hui-Peng Du, Qinglin Zhang, Chong Deng, Qian Chen, Wen Wang, Yang Ai, Zhen-Hua Ling
Title: Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding
Abstract:
Existing speech tokenizers typically assign a fixed number of tokens per second, regardless of the varying information density or temporal fluctuations in the speech signal. This uniform token allocation mismatches the intrinsic structure of speech, where information is distributed unevenly over time. To address this, we propose VARSTok, a VAriable-frame-Rate Speech Tokenizer that adapts token allocation based on local feature similarity. VARSTok introduces two key innovations: (1) a temporal-aware density peak clustering algorithm that adaptively segments speech into variable-length units, and (2) a novel implicit duration coding scheme that embeds both content and temporal span into a single token index, eliminating the need for auxiliary duration predictors. Extensive experiments show that VARSTok significantly outperforms strong fixed-rate baselines. Notably, it achieves superior reconstruction naturalness while using up to 23% fewer tokens than a 40 Hz fixed-frame-rate baseline. VARSTok further yields lower word error rates and improved naturalness in zero-shot text-to-speech synthesis. To the best of our knowledge, this is the first work to demonstrate that a fully dynamic, variable-frame-rate acoustic speech tokenizer can be seamlessly integrated into downstream speech language models. Speech samples are available at https://zhengrachel.github.io/VARSTok.
中文: VARSTok是一种创新的可变帧率语音分词器,通过局部特征相似性自适应分配语音单元,在比固定帧率方法减少高达23%令牌用量的同时,实现了更优越的语音重建与合成质量。
English: VARSTok is a novel variable-frame-rate speech tokenizer that adaptively allocates tokens based on local feature similarity, achieving superior speech reconstruction and synthesis quality while reducing token usage by up to 23% compared to fixed-rate methods.

Authors:Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong
Title: OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Abstract:
We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.
中文: OneCAT 是一种统一的多模态模型,采用纯解码器架构,集理解、生成和编辑于一体,无需外部组件即可实现高效能和领先性能。
English: OneCAT is a unified multimodal model that integrates understanding, generation, and editing in a pure decoder-only transformer, achieving efficiency and state-of-the-art performance without external components.

Authors:Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, Hongkai Xiong
Title: OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation
Abstract:
We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference, leading to significant efficiency gains, especially for high-resolution inputs. This is achieved through a modality-specific Mixture-of-Experts (MoE) structure trained with a single autoregressive (AR) objective, which also natively supports dynamic resolutions. Furthermore, we pioneer a multi-scale visual autoregressive mechanism within the Large Language Model (LLM) that drastically reduces decoding steps compared to diffusion-based methods while maintaining state-of-the-art performance. Our findings demonstrate the powerful potential of pure autoregressive modeling as a sufficient and elegant foundation for unified multimodal intelligence. As a result, OneCAT sets a new performance standard, outperforming existing open-source unified multimodal models across benchmarks for multimodal generation, editing, and understanding.
中文: OneCAT 是一种统一的多模态模型,采用纯解码器架构,集理解、生成和编辑于一体,无需外部组件即可实现高效能和领先性能。
English: OneCAT is a unified multimodal model that integrates understanding, generation, and editing in a pure decoder-only transformer, achieving efficiency and state-of-the-art performance without external components.

Authors:Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi
Title: DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search
Abstract:
Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
中文: DeepSearch将蒙特卡洛树搜索融入RLVR训练,通过系统性探索突破性能瓶颈,以大幅降低的计算成本实现了最先进的推理性能。
English: DeepSearch integrates Monte Carlo Tree Search into RLVR training to overcome performance plateaus by enabling systematic exploration during training, achieving state-of-the-art results with significantly reduced computational costs.

Authors:Dawei Li, Zhen Tan, Chengshuai Zhao, Bohan Jiang, Baixiang Huang, Pingchuan Ma, Abdullah Alnaibari, Kai Shu, Huan Liu
Title: Who's Your Judge? On the Detectability of LLM-Generated Judgments
Abstract:
Large Language Model (LLM)-based judgments leverage powerful LLMs to efficiently evaluate candidate content and provide judgment scores. However, the inherent biases and vulnerabilities of LLM-generated judgments raise concerns, underscoring the urgent need for distinguishing them in sensitive scenarios like academic peer reviewing. In this work, we propose and formalize the task of judgment detection and systematically investigate the detectability of LLM-generated judgments. Unlike LLM-generated text detection, judgment detection relies solely on judgment scores and candidates, reflecting real-world scenarios where textual feedback is often unavailable in the detection process. Our preliminary analysis shows that existing LLM-generated text detection methods perform poorly given their incapability to capture the interaction between judgment scores and candidate content -- an aspect crucial for effective judgment detection. Inspired by this, we introduce \textit{J-Detector}, a lightweight and transparent neural detector augmented with explicitly extracted linguistic and LLM-enhanced features to link LLM judges' biases with candidates' properties for accurate detection. Experiments across diverse datasets demonstrate the effectiveness of \textit{J-Detector} and show how its interpretability enables quantifying biases in LLM judges. Finally, we analyze key factors affecting the detectability of LLM-generated judgments and validate the practical utility of judgment detection in real-world scenarios.
中文: 本研究提出J-Detector这一轻量级神经检测器,通过分析评分与内容间的关联有效识别大语言模型生成的判断,解决了学术评审等关键场景中的偏见问题。
English: This study introduces J-Detector, a lightweight neural tool that effectively identifies LLM-generated judgments by analyzing the relationship between scores and content, addressing biases in critical applications like academic reviews.

Authors:NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu
Title: Pretraining Large Language Models with NVFP4
Abstract:
Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens -- the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.
中文摘要:大型语言模型(LLMs)虽通过扩展规模不断进步,但其训练需消耗巨大资源,因此提升效率至关重要;本研究提出了一种基于NVFP4的创新方法,实现了稳定的4位精度训练,在120亿参数模型上使用10万亿标记训练后,性能与FP8基准相当。
English Summary: Large Language Models (LLMs) are advancing through scaling, but their training demands immense resources, making efficiency crucial; this study introduces a novel NVFP4-based method that enables stable 4-bit precision training, achieving performance comparable to FP8 in a 12-billion-parameter model trained on 10 trillion tokens.

Authors:Ryosuke Takanami, Petr Khrapchenkov, Shu Morikuni, Jumpei Arima, Yuta Takaba, Shunsuke Maeda, Takuya Okubo, Genki Sano, Satoshi Sekioka, Aoi Kadoya, Motonari Kambara, Naoya Nishiura, Haruto Suzuki, Takanori Yoshimoto, Koya Sakamoto, Shinnosuke Ono, Hu Yang, Daichi Yashima, Aoi Horo, Tomohiro Motoda, Kensuke Chiyoma, Hiroshi Ito, Koki Fukuda, Akihito Goto, Kazumi Morinaga, Yuya Ikeda, Riko Kawada, Masaki Yoshikawa, Norio Kosuge, Yuki Noguchi, Kei Ota, Tatsuya Matsushima, Yusuke Iwasawa, Yutaka Matsuo, Tetsuya Ogata
Title: AIRoA MoMa Dataset: A Large-Scale Hierarchical Dataset for Mobile Manipulation
Abstract:
As robots transition from controlled settings to unstructured human environments, building generalist agents that can reliably follow natural language instructions remains a central challenge. Progress in robust mobile manipulation requires large-scale multimodal datasets that capture contact-rich and long-horizon tasks, yet existing resources lack synchronized force-torque sensing, hierarchical annotations, and explicit failure cases. We address this gap with the AIRoA MoMa Dataset, a large-scale real-world multimodal dataset for mobile manipulation. It includes synchronized RGB images, joint states, six-axis wrist force-torque signals, and internal robot states, together with a novel two-layer annotation schema of sub-goals and primitive actions for hierarchical learning and error analysis. The initial dataset comprises 25,469 episodes (approx. 94 hours) collected with the Human Support Robot (HSR) and is fully standardized in the LeRobot v2.1 format. By uniquely integrating mobile manipulation, contact-rich interaction, and long-horizon structure, AIRoA MoMa provides a critical benchmark for advancing the next generation of Vision-Language-Action models. The first version of our dataset is now available at https://huggingface.co/datasets/airoa-org/airoa-moma .
中文: AIRoA MoMa数据集是一个大规模多模态移动操作资源,集成了同步传感器数据和分层标注,填补了接触密集型任务学习的空白,为推进视觉-语言-动作模型提供了关键基准。
English: The AIRoA MoMa Dataset is a large-scale multimodal resource for mobile manipulation, featuring synchronized sensor data and hierarchical annotations to address gaps in contact-rich task learning and advance Vision-Language-Action models.

Authors:Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng
Title: FrameThinker: Learning to Think with Long Videos via Multi-Turn Frame Spotlighting
Abstract:
While Large Vision-Language Models (LVLMs) have achieved substantial progress in video understanding, their application to long video reasoning is hindered by uniform frame sampling and static textual reasoning, which are inefficient and struggle to handle visually intensive video tasks. To overcome these challenges, in this paper, we introduce the concept of thinking with long videos and propose a novel framework FrameThinker. Within this framework, LVLMs are able to iteratively interrogate video content. Developing such video reasoning capabilities in LVLMs presents notable challenges, particularly in adapting the model to new video actions (e.g. select frame), and designing reward functions to guide LVLMs to adopt the newly introduced action. To solve these challenges, we propose a two-phase training strategy, first employing Supervised Fine-Tuning (SFT) to instill fundamental action capabilities, followed by Reinforcement Learning (RL) to optimize a strategic decision-making policy. Notably, in this RL phase, we conduct an in-depth and comprehensive exploration of the reward design for each action and format reward. Extensive experiments on reasoning benchmarks like Video-Holmes, LongVideo-Reason, and long-video understanding benchmarks such as LongVideoBench, MLVU, VideoMME, and LVBench, demonstrate that FrameThinker achieves a significant average improvement of +10.4% over baselines while drastically reducing the number of processed frames. Most notably, our 7B model, FrameThinker establishes a new state-of-the-art on LongVideo-Reason, achieving 76.1% accuracy using an average of only 20.6 frames. This not only outperforms the competitive LongVILA-R1 (72.0%) but does so with over 20x fewer frames (vs. 512), demonstrating unparalleled efficiency and effectiveness.
中文: FrameThinker提出了一种新颖框架,通过两阶段训练策略使大型视觉语言模型能够迭代式分析长视频,以极少的帧处理量实现了最先进的性能表现。
English: FrameThinker introduces a novel framework enabling Large Vision-Language Models to iteratively interrogate long videos through a two-phase training strategy, achieving state-of-the-art performance with significantly reduced frame processing.

Authors:Ran Xu, Kaixin Ma, Wenhao Yu, Hongming Zhang, Joyce C. Ho, Carl Yang, Dong Yu
Title: Retrieval-augmented GUI Agents with Generative Guidelines
Abstract:
GUI agents powered by vision-language models (VLMs) show promise in automating complex digital tasks. However, their effectiveness in real-world applications is often limited by scarce training data and the inherent complexity of these tasks, which frequently require long-tailed knowledge covering rare, unseen scenarios. We propose RAG-GUI , a lightweight VLM that leverages web tutorials at inference time. RAG-GUI is first warm-started via supervised finetuning (SFT) and further refined through self-guided rejection sampling finetuning (RSF). Designed to be model-agnostic, RAG-GUI functions as a generic plug-in that enhances any VLM-based agent. Evaluated across three distinct tasks, it consistently outperforms baseline agents and surpasses other inference baselines by 2.6% to 13.3% across two model sizes, demonstrating strong generalization and practical plug-and-play capabilities in real-world scenarios.
中文: RAG-GUI是一种轻量级视觉语言模型,通过利用网络教程和两阶段训练方法增强GUI代理,在实际任务中展现出卓越性能和即插即用的适应性。
English: RAG-GUI is a lightweight vision-language model that enhances GUI agents by leveraging web tutorials and a two-stage training process, achieving superior performance and plug-and-play adaptability in real-world tasks.

Authors:Xiangfei Qiu, Liu Yang, Hanyin Cheng, Xingjian Wu, Rongjia Wu, Zhigang Zhang, Ding Tu, Chenjuan Guo, Bin Yang, Christian S. Jensen, Jilin Hu
Title: Multi-Scale Spatial-Temporal Hypergraph Network with Lead-Lag Structures for Stock Time Series Forecasting
Abstract:
Time series forecasting occurs in a range of financial applications providing essential decision-making support to investors, regulatory institutions, and analysts. Unlike multivariate time series from other domains, stock time series exhibit industry correlation. Exploiting this kind of correlation can improve forecasting accuracy. However, existing methods based on hypergraphs can only capture industry correlation relatively superficially. These methods face two key limitations: they do not fully consider inter-industry lead-lag interactions, and they do not model multi-scale information within and among industries. This study proposes the Hermes framework for stock time series forecasting that aims to improve the exploitation of industry correlation by eliminating these limitations. The framework integrates moving aggregation and multi-scale fusion modules in a hypergraph network. Specifically, to more flexibly capture the lead-lag relationships among industries, Hermes proposes a hyperedge-based moving aggregation module. This module incorporates a sliding window and utilizes dynamic temporal aggregation operations to consider lead-lag dependencies among industries. Additionally, to effectively model multi-scale information, Hermes employs cross-scale, edge-to-edge message passing to integrate information from different scales while maintaining the consistency of each scale. Experimental results on multiple real-world stock datasets show that Hermes outperforms existing state-of-the-art methods in both efficiency and accuracy.
中文: Hermes框架通过引入基于超边的移动聚合和多尺度融合模块,更有效地捕捉行业间的领先-滞后关系和多尺度信息,在多个真实股票数据集上实现了效率和精度的显著提升。
English: The Hermes framework enhances stock time series forecasting by addressing limitations in capturing industry correlations through a hypergraph network with moving aggregation and multi-scale fusion modules, leading to superior efficiency and accuracy in experiments.

Authors:Xvyuan Liu, Xiangfei Qiu, Hanyin Cheng, Xingjian Wu, Chenjuan Guo, Bin Yang, Jilin Hu
Title: ASTGI: Adaptive Spatio-Temporal Graph Interactions for Irregular Multivariate Time Series Forecasting
Abstract:
Irregular multivariate time series (IMTS) are prevalent in critical domains like healthcare and finance, where accurate forecasting is vital for proactive decision-making. However, the asynchronous sampling and irregular intervals inherent to IMTS pose two core challenges for existing methods: (1) how to accurately represent the raw information of irregular time series without introducing data distortion, and (2) how to effectively capture the complex dynamic dependencies between observation points. To address these challenges, we propose the Adaptive Spatio-Temporal Graph Interaction (ASTGI) framework. Specifically, the framework first employs a Spatio-Temporal Point Representation module to encode each discrete observation as a point within a learnable spatio-temporal embedding space. Second, a Neighborhood-Adaptive Graph Construction module adaptively builds a causal graph for each point in the embedding space via nearest neighbor search. Subsequently, a Spatio-Temporal Dynamic Propagation module iteratively updates information on these adaptive causal graphs by generating messages and computing interaction weights based on the relative spatio-temporal positions between points. Finally, a Query Point-based Prediction module generates the final forecast by aggregating neighborhood information for a new query point and performing regression. Extensive experiments on multiple benchmark datasets demonstrate that ASTGI outperforms various state-of-the-art methods.
中文:ASTGI框架通过将离散观测编码至可学习的时空嵌入空间,并基于自适应因果图动态传播信息,有效解决了不规则多元时间序列预测中的核心难题,在多个基准测试中展现出卓越性能。
English: The ASTGI framework effectively addresses the challenges of irregular multivariate time series forecasting by encoding observations in a learnable spatio-temporal space and adaptively propagating information through dynamic causal graphs, demonstrating superior performance across multiple benchmarks.

Authors:Kohei Sendai, Maxime Alvarez, Tatsuya Matsushima, Yutaka Matsuo, Yusuke Iwasawa
Title: Leave No Observation Behind: Real-time Correction for VLA Action Chunks
Abstract:
To improve efficiency and temporal coherence, Vision-Language-Action (VLA) models often predict action chunks; however, this action chunking harms reactivity under inference delay and long horizons. We introduce Asynchronous Action Chunk Correction (A2C2), which is a lightweight real-time chunk correction head that runs every control step and adds a time-aware correction to any off-the-shelf VLA's action chunk. The module combines the latest observation, the predicted action from VLA (base action), a positional feature that encodes the index of the base action within the chunk, and some features from the base policy, then outputs a per-step correction. This preserves the base model's competence while restoring closed-loop responsiveness. The approach requires no retraining of the base policy and is orthogonal to asynchronous execution schemes such as Real Time Chunking (RTC). On the dynamic Kinetix task suite (12 tasks) and LIBERO Spatial, our method yields consistent success rate improvements across increasing delays and execution horizons (+23% point and +7% point respectively, compared to RTC), and also improves robustness for long horizons even with zero injected delay. Since the correction head is small and fast, there is minimal overhead compared to the inference of large VLA models. These results indicate that A2C2 is an effective, plug-in mechanism for deploying high-capacity chunking policies in real-time control.
中文: 异步动作块校正(A2C2)模块通过实时修正动作块来增强视觉-语言-动作模型,无需重新训练基础策略或增加显著开销,即可提高响应能力和成功率。
English: The Asynchronous Action Chunk Correction (A2C2) module enhances Vision-Language-Action models by adding real-time corrections to action chunks, improving reactivity and success rates without retraining the base policy or adding significant overhead.

Authors:Zhen-Hao Wen, Yan Wang, Ji Feng, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou
Title: Hierarchical Representation Matching for CLIP-based Class-Incremental Learning
Abstract:
Class-Incremental Learning (CIL) aims to endow models with the ability to continuously adapt to evolving data streams. Recent advances in pre-trained vision-language models (e.g., CLIP) provide a powerful foundation for this task. However, existing approaches often rely on simplistic templates, such as "a photo of a [CLASS]", which overlook the hierarchical nature of visual concepts. For example, recognizing "cat" versus "car" depends on coarse-grained cues, while distinguishing "cat" from "lion" requires fine-grained details. Similarly, the current feature mapping in CLIP relies solely on the representation from the last layer, neglecting the hierarchical information contained in earlier layers. In this work, we introduce HiErarchical Representation MAtchiNg (HERMAN) for CLIP-based CIL. Our approach leverages LLMs to recursively generate discriminative textual descriptors, thereby augmenting the semantic space with explicit hierarchical cues. These descriptors are matched to different levels of the semantic hierarchy and adaptively routed based on task-specific requirements, enabling precise discrimination while alleviating catastrophic forgetting in incremental tasks. Extensive experiments on multiple benchmarks demonstrate that our method consistently achieves state-of-the-art performance.
Chinese: HERMAN方法通过利用大语言模型生成层次化文本描述符,并将其与视觉特征自适应匹配,从而提升了基于CLIP的类增量学习的判别能力并缓解了灾难性遗忘问题。
English: HERMAN enhances CLIP-based Class-Incremental Learning by using LLMs to generate hierarchical textual descriptors, which are adaptively matched to visual features to improve discrimination and reduce forgetting.

Authors:Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo
Title: Aurora: Towards Universal Generative Multimodal Time Series Forecasting
Abstract:
Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Corss-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corrsponding text or image modalities, thus possessing strong Cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on well-recognized benchmarks, including TimeMMD, TSFM-Bench and ProbTS, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.
中文摘要:Aurora是一种多模态时间序列基础模型,通过自适应提取文本和图像中的领域知识并结合原型引导的概率预测,在单模态和多模态场景中均实现了零样本推理的跨域泛化最优性能。
English Summary: Aurora is a multimodal time series foundation model that leverages cross-domain text and image inputs through adaptive knowledge extraction and prototype-guided forecasting, achieving state-of-the-art performance in both unimodal and multimodal scenarios with zero-shot inference capability.

Authors:Xingjian Wu, Zhengyu Li, Hanyin Cheng, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang
Title: Unlocking the Power of Mixture-of-Experts for Task-Aware Time Series Analytics
Abstract:
Time Series Analysis is widely used in various real-world applications such as weather forecasting, financial fraud detection, imputation for missing data in IoT systems, and classification for action recognization. Mixture-of-Experts (MoE), as a powerful architecture, though demonstrating effectiveness in NLP, still falls short in adapting to versatile tasks in time series analytics due to its task-agnostic router and the lack of capability in modeling channel correlations. In this study, we propose a novel, general MoE-based time series framework called PatchMoE to support the intricate ``knowledge'' utilization for distinct tasks, thus task-aware. Based on the observation that hierarchical representations often vary across tasks, e.g., forecasting vs. classification, we propose a Recurrent Noisy Gating to utilize the hierarchical information in routing, thus obtaining task-sepcific capability. And the routing strategy is operated on time series tokens in both temporal and channel dimensions, and encouraged by a meticulously designed Temporal \& Channel Load Balancing Loss to model the intricate temporal and channel correlations. Comprehensive experiments on five downstream tasks demonstrate the state-of-the-art performance of PatchMoE.
中文摘要:本研究提出PatchMoE这一面向任务的混合专家框架,通过循环噪声门控和多维路由策略提升时序分析能力,在五项下游任务中实现了最先进的性能表现。
English Summary: This study introduces PatchMoE, a task-aware Mixture-of-Experts framework that enhances time series analysis through recurrent noisy gating and multi-dimensional routing, achieving state-of-the-art performance across five downstream tasks.

Authors:Zeyu Wang, Baiyu Chen, Kun Yan, Hongjing Piao, Hao Xue, Flora D. Salim, Yuanchun Shi, Yuntao Wang
Title: Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
Abstract:
With the rise in popularity of smart glasses, users' attention has been integrated into Vision-Language Models (VLMs) to streamline multi-modal querying in daily scenarios. However, leveraging gaze data to model users' attention may introduce ambiguity challenges: (1) users' verbal questions become ambiguous by using pronouns or skipping context, (2) humans' gaze patterns can be noisy and exhibit complex spatiotemporal relationships with their spoken questions. Previous works only consider single image as visual modality input, failing to capture the dynamic nature of the user's attention. In this work, we introduce GLARIFY, a novel method to leverage spatiotemporal gaze information to enhance the model's effectiveness in real-world applications. Initially, we analyzed hundreds of querying samples with the gaze modality to demonstrate the noisy nature of users' gaze patterns. We then utilized GPT-4o to design an automatic data synthesis pipeline to generate the GLARIFY-Ambi dataset, which includes a dedicated chain-of-thought (CoT) process to handle noisy gaze patterns. Finally, we designed a heatmap module to incorporate gaze information into cutting-edge VLMs while preserving their pretrained knowledge. We evaluated GLARIFY using a hold-out test set. Experiments demonstrate that GLARIFY significantly outperforms baselines. By robustly aligning VLMs with human attention, GLARIFY paves the way for a usable and intuitive interaction paradigm with a visual assistant.
Chinese: GLARIFY是一种创新方法,利用时空注视数据解决用户查询的模糊性和注视模式的噪声问题,显著提升了视觉语言模型在实际应用中的性能表现。
English: GLARIFY is a novel method that utilizes spatiotemporal gaze data to address ambiguity in user queries and noisy gaze patterns, significantly enhancing Vision-Language Models' performance in real-world applications.

Authors:Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Title: RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Abstract:
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.
中文摘要:本研究提出新框架分析强化学习与监督微调如何影响大语言模型的推理路径,发现强化学习将推理功能集中于少数步骤,而监督微调将其分散至更多步骤,从而解释了两者顺序组合的有效性。
English Summary: This study introduces a novel framework analyzing how reinforcement learning and supervised fine-tuning shape reasoning paths in large language models, revealing that RL concentrates reasoning into fewer steps while SFT distributes it more broadly, explaining why their sequential combination proves effective.

Authors:Keno Harada, Yudai Yamazaki, Masachika Taniguchi, Edison Marrese-Taylor, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Title: When Instructions Multiply: Measuring and Estimating LLM Capabilities of Multiple Instructions Following
Abstract:
As large language models (LLMs) are increasingly applied to real-world scenarios, it becomes crucial to understand their ability to follow multiple instructions simultaneously. To systematically evaluate these capabilities, we introduce two specialized benchmarks for fundamental domains where multiple instructions following is important: Many Instruction-Following Eval (ManyIFEval) for text generation with up to ten instructions, and Style-aware Mostly Basic Programming Problems (StyleMBPP) for code generation with up to six instructions. Our experiments with the created benchmarks across ten LLMs reveal that performance consistently degrades as the number of instructions increases. Furthermore, given the fact that evaluating all the possible combinations of multiple instructions is computationally impractical in actual use cases, we developed three types of regression models that can estimate performance on both unseen instruction combinations and different numbers of instructions which are not used during training. We demonstrate that a logistic regression model using instruction count as an explanatory variable can predict performance of following multiple instructions with approximately 10% error, even for unseen instruction combinations. We show that relatively modest sample sizes (500 for ManyIFEval and 300 for StyleMBPP) are sufficient for performance estimation, enabling efficient evaluation of LLMs under various instruction combinations.
中文: 本研究引入两个基准评估大语言模型遵循多重指令的能力,发现随着指令数量增加性能持续下降,并证明逻辑回归模型能以约10%的误差预测结果,且仅需适量样本即可实现高效评估。
English: This study introduces two benchmarks to evaluate large language models' ability to follow multiple instructions, revealing performance declines with increasing instruction counts and demonstrating that logistic regression can predict these outcomes with about 10% error using manageable sample sizes.

Authors:Wei Gao, Yuheng Zhao, Dakai An, Tianyuan Wu, Lunxi Cao, Shaopan Xiong, Ju Huang, Weixun Wang, Siran Yang, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang
Title: RollPacker: Mitigating Long-Tail Rollouts for Fast, Synchronous RL Post-Training
Abstract:
Reinforcement Learning (RL) is a pivotal post-training technique for enhancing the reasoning capabilities of Large Language Models (LLMs). However, synchronous RL post-training often suffers from significant GPU underutilization, referred to as bubbles, caused by imbalanced response lengths within rollout steps. Many RL systems attempt to alleviate this problem by relaxing synchronization, but this can compromise training accuracy. In this paper, we introduce tail batching, a novel rollout scheduling strategy for synchronous RL that systematically consolidates prompts leading to long-tail responses into a small subset of rollout steps (long rounds), while ensuring that the majority of steps (short rounds) involve only balanced, short rollouts. By excluding long responses from short rounds and rescheduling them into a few designated long rounds, tail batching effectively reduces GPU idle time during rollouts and significantly accelerates RL training without sacrificing accuracy. We present RollPacker, a system that fully harnesses the benefits of tail batching through holistic optimizations across all three RL stages: elastic parallelism adaptation for rollout, dynamic resource allocation and scheduling for reward, and stream-based training. Empirical results show that RollPacker achieves a 2.03x-2.56x end-to-end training time reduction compared to veRL and up to 2.24x speedup compared to RLHFuse for the Qwen2.5 family of LLMs on up to 128 H800 GPUs.
中文: 尾批处理是一种创新的同步强化学习调度策略,通过将长尾响应集中到特定步骤中,在保证准确性的同时显著减少GPU空闲时间,实现训练速度提升2.03-2.56倍。
English: Tail batching is a novel synchronous reinforcement learning scheduling strategy that groups long-tail responses into dedicated steps, significantly reducing GPU idle time and accelerating training by 2.03x-2.56x without compromising accuracy.

Authors:Bum Jun Kim, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo
Title: Unlocking Noise-Resistant Vision: Key Architectural Secrets for Robust Models
Abstract:
While the robustness of vision models is often measured, their dependence on specific architectural design choices is rarely dissected. We investigate why certain vision architectures are inherently more robust to additive Gaussian noise and convert these empirical insights into simple, actionable design rules. Specifically, we performed extensive evaluations on 1,174 pretrained vision models, empirically identifying four consistent design patterns for improved robustness against Gaussian noise: larger stem kernels, smaller input resolutions, average pooling, and supervised vision transformers (ViTs) rather than CLIP ViTs, which yield up to 506 rank improvements and 21.6\%p accuracy gains. We then develop a theoretical analysis that explains these findings, converting observed correlations into causal mechanisms. First, we prove that low-pass stem kernels attenuate noise with a gain that decreases quadratically with kernel size and that anti-aliased downsampling reduces noise energy roughly in proportion to the square of the downsampling factor. Second, we demonstrate that average pooling is unbiased and suppresses noise in proportion to the pooling window area, whereas max pooling incurs a positive bias that grows slowly with window size and yields a relatively higher mean-squared error and greater worst-case sensitivity. Third, we reveal and explain the vulnerability of CLIP ViTs via a pixel-space Lipschitz bound: The smaller normalization standard deviations used in CLIP preprocessing amplify worst-case sensitivity by up to 1.91 times relative to the Inception-style preprocessing common in supervised ViTs. Our results collectively disentangle robustness into interpretable modules, provide a theory that explains the observed trends, and build practical, plug-and-play guidelines for designing vision models more robust against Gaussian noise.
中文摘要:本研究揭示了提升视觉模型对高斯噪声鲁棒性的关键架构设计选择,包括更大的主干核、更小的输入分辨率、平均池化和监督式ViT,并通过理论分析解释了这些机制。
English Summary: This study identifies key architectural design choices that enhance vision models' robustness to Gaussian noise, including larger stem kernels, smaller input resolutions, average pooling, and supervised ViTs, supported by theoretical analysis explaining these mechanisms.

Authors:Kangtao Lv, Haibin Chen, Yujin Yuan, Langming Liu, Shilei Liu, Yongwei Wang, Wenbo Su, Bo Zheng
Title: How to inject knowledge efficiently? Knowledge Infusion Scaling Law for Pre-training Large Language Models
Abstract:
Large language models (LLMs) have attracted significant attention due to their impressive general capabilities across diverse downstream tasks. However, without domain-specific optimization, they often underperform on specialized knowledge benchmarks and even produce hallucination. Recent studies show that strategically infusing domain knowledge during pretraining can substantially improve downstream performance. A critical challenge lies in balancing this infusion trade-off: injecting too little domain-specific data yields insufficient specialization, whereas excessive infusion triggers catastrophic forgetting of previously acquired knowledge. In this work, we focus on the phenomenon of memory collapse induced by over-infusion. Through systematic experiments, we make two key observations, i.e. 1) Critical collapse point: each model exhibits a threshold beyond which its knowledge retention capabilities sharply degrade. 2) Scale correlation: these collapse points scale consistently with the model's size. Building on these insights, we propose a knowledge infusion scaling law that predicts the optimal amount of domain knowledge to inject into large LLMs by analyzing their smaller counterparts. Extensive experiments across different model sizes and pertaining token budgets validate both the effectiveness and generalizability of our scaling law.
中文: 大型语言模型面临领域专业化与知识保留的平衡难题,本研究通过系统实验发现了过度注入知识时的临界崩溃点,并提出基于小模型分析的缩放定律来优化领域知识注入量。
English: Large language models face a trade-off between domain specialization and knowledge retention, with this study identifying a critical collapse point during over-infusion and proposing a scaling law to optimize domain knowledge injection by analyzing smaller models.

Authors:Xiangfei Qiu, Yuhan Zhu, Zhengyu Li, Hanyin Cheng, Xingjian Wu, Chenjuan Guo, Bin Yang, Jilin Hu
Title: DAG: A Dual Causal Network for Time Series Forecasting with Exogenous Variables
Abstract:
Time series forecasting is crucial in various fields such as economics, traffic, and AIOps. However, in real-world applications, focusing solely on the endogenous variables (i.e., target variables), is often insufficient to ensure accurate predictions. Considering exogenous variables (i.e., covariates) provides additional predictive information, thereby improving forecasting accuracy. However, existing methods for time series forecasting with exogenous variables (TSF-X) have the following shortcomings: 1) they do not leverage future exogenous variables, 2) they fail to account for the causal relationships between endogenous and exogenous variables. As a result, their performance is suboptimal. In this study, to better leverage exogenous variables, especially future exogenous variable, we propose a general framework DAG, which utilizes dual causal network along both the temporal and channel dimensions for time series forecasting with exogenous variables. Specifically, we first introduce the Temporal Causal Module, which includes a causal discovery module to capture how historical exogenous variables affect future exogenous variables. Following this, we construct a causal injection module that incorporates the discovered causal relationships into the process of forecasting future endogenous variables based on historical endogenous variables. Next, we propose the Channel Causal Module, which follows a similar design principle. It features a causal discovery module models how historical exogenous variables influence historical endogenous variables, and a causal injection module incorporates the discovered relationships to enhance the prediction of future endogenous variables based on future exogenous variables.
中文摘要:本研究提出的DAG框架通过沿时间和通道维度构建双重因果网络,有效利用外生变量(包括未来变量)与内生变量间的因果关系,显著提升了时序预测性能。
English Summary: The proposed DAG framework enhances time series forecasting by modeling dual causal relationships between endogenous and exogenous variables across temporal and channel dimensions to overcome limitations of existing methods.

Authors:Yiyuan Yang, Guodong Long, Qinghua Lu, Liming Zhu, Jing Jiang
Title: Bi-level Personalization for Federated Foundation Models: A Task-vector Aggregation Approach
Abstract:
Federated foundation models represent a new paradigm to jointly fine-tune pre-trained foundation models across clients. It is still a challenge to fine-tune foundation models for a small group of new users or specialized scenarios, which typically involve limited data compared to the large-scale data used in pre-training. In this context, the trade-off between personalization and federation becomes more sensitive. To tackle these, we proposed a bi-level personalization framework for federated fine-tuning on foundation models. Specifically, we conduct personalized fine-tuning on the client-level using its private data, and then conduct a personalized aggregation on the server-level using similar users measured by client-specific task vectors. Given the personalization information gained from client-level fine-tuning, the server-level personalized aggregation can gain group-wise personalization information while mitigating the disturbance of irrelevant or interest-conflict clients with non-IID data. The effectiveness of the proposed algorithm has been demonstrated by extensive experimental analysis in benchmark datasets.
中文: 本文提出了一种双层个性化联邦微调框架,通过客户端个性化训练和基于用户相似性的服务器聚合,有效解决了数据有限和非独立同分布场景下的个性化与联邦学习平衡问题。
English: This paper introduces a bi-level personalization framework for federated fine-tuning of foundation models, enabling client-level personalized training and server-level aggregation based on user similarity to address data limitations and non-IID challenges.

Authors:Saketh Vishnubhatla, Ujun Jeong, Bohan Jiang, Paras Sheth, Zhen Tan, Adrienne Raglin, Huan Liu
Title: Assessing On-the-Ground Disaster Impact Using Online Data Sources
Abstract:
Assessing the impact of a disaster in terms of asset losses and human casualties is essential for preparing effective response plans. Traditional methods include offline assessments conducted on the ground, where volunteers and first responders work together to collect the estimate of losses through windshield surveys or on-ground inspection. However, these methods have a time delay and are prone to different biases. Recently, various online data sources, including social media, news reports, aerial imagery, and satellite data, have been utilized to evaluate the impact of disasters. Online data sources provide real-time data streams for estimating the offline impact. Limited research exists on how different online sources help estimate disaster impact at a given administrative unit. In our work, we curate a comprehensive dataset by collecting data from multiple online sources for a few billion-dollar disasters at the county level. We also analyze how online estimates compare with traditional offline-based impact estimates for the disaster. Our findings provide insight into how different sources can provide complementary information to assess the disaster.
中文: 传统灾害评估方法缓慢且存在偏差,而在线数据源能提供实时信息,我们的研究通过对比揭示了二者在灾害评估中的互补作用。
English: Traditional disaster impact assessments are slow and biased, while online data sources offer real-time insights, and our research compares both to reveal their complementary roles in evaluating disasters.

Authors:Chengbing Wang, Yang Zhang, Zhicheng Wang, Tianhao Shi, Keqin Bao, Fuli Feng, Tat-Seng Chua
Title: Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation
Abstract:
Fine-tuning large language models (LLMs) for recommendation in a generative manner has delivered promising results, but encounters significant inference overhead due to autoregressive decoding in the language space. This work explores bypassing language-space decoding by directly matching candidate items with the LLM's internal thought representations in the latent space, eliminating the time-consuming autoregressive process to reduce computational costs. Towards this, we introduce Light Latent-space Decoding (L2D), an effective and efficient latent-space decoding method. L2D represents user-preferred items by using the hidden states of test sequences reflecting the LLM's internal thought, and obtains candidate item representations from the hidden states of training sequences labeled with the corresponding candidate items. It then matches the two types of representations to decode items, achieving latent-space decoding. In this way, it enables efficient decoding without altering the LLM's generative tuning paradigm, thereby preserving performance. Extensive empirical results demonstrate that L2D is more than 10x faster than language-space decoding while maintaining or enhancing performance.
中文摘要:本文提出轻量潜在空间解码(L2D)方法,通过在潜在空间中匹配候选项目与大语言模型的内部思维表征,避免了语言空间的自回归解码过程,在保持性能的同时实现10倍以上的加速效果。
English Summary: This paper introduces Light Latent-space Decoding (L2D), an efficient method that matches candidate items with LLM's internal representations in latent space to eliminate autoregressive decoding, achieving over 10x speedup while maintaining performance.

Authors:Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang
Title: Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation
Abstract:
Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.
中文: 研究发现,当前视觉语言模型的性别偏见评估因与非性别特征(如物体和背景)的虚假相关性而严重失真,即使对这些特征进行微小扰动也会显著改变偏见得分,从而影响评估的可靠性。
English: This study reveals that current gender bias evaluations in vision-language models are heavily distorted by spurious correlations with non-gender features, as even minor perturbations to objects or backgrounds can drastically alter bias scores, undermining their reliability.

Authors:Yusuke Hirota, Ryo Hachiuma, Boyi Li, Ximing Lu, Michael Ross Boone, Boris Ivanovic, Yejin Choi, Marco Pavone, Yu-Chiang Frank Wang, Noa Garcia, Yuta Nakashima, Chao-Han Huck Yang
Title: Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation
Abstract:
Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.
中文: 研究发现,当前视觉语言模型的性别偏见评估因与非性别特征(如物体和背景)的虚假相关性而严重失真,即使对这些特征进行微小扰动也会显著改变偏见得分,从而影响评估的可靠性。
English: This study reveals that current gender bias evaluations in vision-language models are heavily distorted by spurious correlations with non-gender features, as even minor perturbations to objects or backgrounds can drastically alter bias scores, undermining their reliability.

Authors:Ao Chang, Yubo Chen, Jun Zhao
Title: PL-CA: A Parametric Legal Case Augmentation Framework
Abstract:
Conventional RAG is considered one of the most effective methods for addressing model knowledge insufficiency and hallucination, particularly in the judicial domain that requires high levels of knowledge rigor, logical consistency, and content integrity. However, the conventional RAG method only injects retrieved documents directly into the model's context, which severely constrains models due to their limited context windows and introduces additional computational overhead through excessively long contexts, thereby disrupting models' attention and degrading performance on downstream tasks. Moreover, many existing benchmarks lack expert annotation and focus solely on individual downstream tasks while real-world legal scenarios consist of multiple mixed legal tasks, indicating conventional benchmarks' inadequacy for reflecting models' true capabilities. To address these limitations, we propose PL-CA, which introduces a parametric RAG (P-RAG) framework to perform data augmentation on corpus knowledge and encode this legal knowledge into parametric vectors, and then integrates this parametric knowledge into the LLM's feed-forward networks (FFN) via LoRA, thereby alleviating models' context pressure. Additionally, we also construct a multi-task legal dataset comprising more than 2000 training and test instances, which are all expert-annotated and manually verified. We conduct our experiments on our dataset, and the experimental results demonstrate that our method reduces the overhead associated with excessively long contexts while maintaining competitive performance on downstream tasks compared to conventional RAG. Our code and dataset are provided in the appendix.
中文: 提出的PL-CA方法引入参数化RAG框架,将法律知识编码为向量并通过LoRA集成,在保持下游任务竞争力的同时减轻上下文负担,并构建了专家标注的多任务法律数据集进行评估。
English: The proposed PL-CA method introduces a parametric RAG framework that encodes legal knowledge into vectors integrated via LoRA, reducing context overhead while maintaining competitive performance, and establishes an expert-annotated multi-task legal dataset for evaluation.

Authors:Hao-Nan Shi, Ting-Ji Huang, Lu Han, De-Chuan Zhan, Han-Jia Ye
Title: One-Embedding-Fits-All: Efficient Zero-Shot Time Series Forecasting by a Model Zoo
Abstract:
The proliferation of Time Series Foundation Models (TSFMs) has significantly advanced zero-shot forecasting, enabling predictions for unseen time series without task-specific fine-tuning. Extensive research has confirmed that no single TSFM excels universally, as different models exhibit preferences for distinct temporal patterns. This diversity suggests an opportunity: how to take advantage of the complementary abilities of TSFMs. To this end, we propose ZooCast, which characterizes each model's distinct forecasting strengths. ZooCast can intelligently assemble current TSFMs into a model zoo that dynamically selects optimal models for different forecasting tasks. Our key innovation lies in the One-Embedding-Fits-All paradigm that constructs a unified representation space where each model in the zoo is represented by a single embedding, enabling efficient similarity matching for all tasks. Experiments demonstrate ZooCast's strong performance on the GIFT-Eval zero-shot forecasting benchmark while maintaining the efficiency of a single TSFM. In real-world scenarios with sequential model releases, the framework seamlessly adds new models for progressive accuracy gains with negligible overhead.
中文: ZooCast提出了一种模型库,通过将每个时序基础模型表示为单一嵌入,动态选择最适合不同预测任务的模型,在零样本预测中表现出色且高效。
English: ZooCast introduces a model zoo that dynamically selects optimal Time Series Foundation Models for various forecasting tasks by representing each model with a single embedding, achieving strong zero-shot performance and efficiency.

Authors:Zhilin Wang, Zhe Yang, Yun Luo, Yafu Li, Xiaoye Qu, Ziqian Qiao, Haoran Zhang, Runzhe Zhan, Derek F. Wong, Jizhe Zhou, Yu Cheng
Title: Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning
Abstract:
Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. Inspired by mathematics, where simple operations yield infinite verifiable problems, we introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions to systematically synthesize a vast and diverse corpus of sheet music reasoning problems. This approach allows us to introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music, while also pointing out the ongoing challenges in understanding sheet music in a visual format. By leveraging synthetic data for RLVR, all models show significant improvements on the SSMR-Bench. Additionally, they also demonstrate considerable advancements on previously established human-crafted benchmarks, such as MusicTheoryBench and the music subset of MMMU. Finally, our results show that the enhanced reasoning ability can also facilitate music composition.
中文: 本研究提出一种基于音乐理论合成可验证乐谱问题的框架,有效提升了AI模型在乐谱理解和音乐创作方面的推理能力。
English: This study introduces a synthetic data framework for generating verifiable sheet music problems to enhance AI models' reasoning in music interpretation, demonstrating improved performance in benchmarks and music composition.

Authors:Ziye Jia, Jia He, Yuanhao Cui, Qiuming Zhu, Ligang Yuan, Fuhui Zhou, Qihui Wu, Dusit Niyato, Zhu Han
Title: Hierarchical Low-Altitude Wireless Network Empowered Air Traffic Management
Abstract:
As the increasing development of low-altitude aircrafts, the rational design of low-altitude networks directly impacts the aerial safety and resource utilization. To address the challenges of environmental complexity and aircraft diversity in the traffic management, we propose a hierarchical low-altitude wireless network (HLWN) framework. Empowered by the threedimensional spatial discretization and integrated wireless monitoring mechanisms in HLWN, we design low-altitude air corridors to guarantee safe operation and optimization. Besides, we develop the multi-dimensional flight risk assessment through conflict detection and probabilistic collision analysis, facilitating dynamic collision avoidance for heterogeneous aircrafts. Finally, the open issues and future directions are investigated to provide insights into HLAN development.
中文摘要:本文提出的分层低空无线网络框架通过三维空间离散化和综合无线监测机制,构建低空航路保障运行安全,并利用多维飞行风险评估实现异构航空器的动态防撞。
English Summary: The proposed hierarchical low-altitude wireless network framework addresses air traffic safety through spatial discretization, wireless monitoring, and multi-dimensional risk assessment to enable dynamic collision avoidance for diverse aircraft.

Authors:Yixian Zhang, Shu'ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, Wenbo Ding
Title: SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
Abstract:
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.
Chinese: 该研究通过揭示基于流的策略训练与循环计算的等价性,提出了稳定的架构Flow-G和Flow-T,并开发了一种实用的SAC算法,无需常用变通方法即可在连续控制和机器人操作任务中取得最优性能。
English: The study addresses the instability in training flow-based policies by identifying their equivalence to recurrent computations and introducing stable architectures, Flow-G and Flow-T, along with a practical SAC-based algorithm that achieves top performance without needing common workarounds.

Authors:Shilong Ji, Yinuo Chen, Chuqi Wang, Jiayu Chen, Ruize Zhang, Feng Gao, Wenhao Tang, Shu'ang Yu, Sirui Xiang, Xinlei Chen, Chao Yu, Yu Wang
Title: JuggleRL: Mastering Ball Juggling with a Quadrotor via Deep Reinforcement Learning
Abstract:
Aerial robots interacting with objects must perform precise, contact-rich maneuvers under uncertainty. In this paper, we study the problem of aerial ball juggling using a quadrotor equipped with a racket, a task that demands accurate timing, stable control, and continuous adaptation. We propose JuggleRL, the first reinforcement learning-based system for aerial juggling. It learns closed-loop policies in large-scale simulation using systematic calibration of quadrotor and ball dynamics to reduce the sim-to-real gap. The training incorporates reward shaping to encourage racket-centered hits and sustained juggling, as well as domain randomization over ball position and coefficient of restitution to enhance robustness and transferability. The learned policy outputs mid-level commands executed by a low-level controller and is deployed zero-shot on real hardware, where an enhanced perception module with a lightweight communication protocol reduces delays in high-frequency state estimation and ensures real-time control. Experiments show that JuggleRL achieves an average of $311$ hits over $10$ consecutive trials in the real world, with a maximum of $462$ hits observed, far exceeding a model-based baseline that reaches at most $14$ hits with an average of $3.1$. Moreover, the policy generalizes to unseen conditions, successfully juggling a lighter $5$ g ball with an average of $145.9$ hits. This work demonstrates that reinforcement learning can empower aerial robots with robust and stable control in dynamic interaction tasks.
中文: 本文提出JuggleRL强化学习系统,通过模拟训练闭环策略并零样本部署到真实四旋翼飞行器上,实现了空中颠球任务,其表现远超基于模型的方法,并能泛化至未训练场景,展现了强化学习在动态交互任务中的强大控制能力。
English: This paper introduces JuggleRL, a reinforcement learning system that enables a quadrotor with a racket to perform aerial ball juggling by learning closed-loop policies in simulation and deploying them zero-shot on real hardware, achieving significantly more hits than model-based methods and demonstrating robust generalization to unseen conditions.

Authors:Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu
Title: Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment
Abstract:
Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone technology for aligning Large Language Models (LLMs) with human values. However, these methods are all underpinned by a critical, yet flawed assumption: human preferences are homogeneous (representing a single, unified preference) and the collected data is noiseless (free from error). In reality, neither is true since human preference is pluralistic and annotators can make mistakes. This creates a discrepancy between the recorded data and the ground-truth preferences, which can misguide the model and degrade its performance. To address this challenge, we introduce Latent Collective Preference Optimization (LCPO). LCPO leverages an Expectation-Maximization (EM) algorithm to learn the latent collective consensus from noisy data. It operates by inferring the correctness of each preference label and using this probability as an adaptive weight to re-calibrate each data point's contribution to the training loss, thereby mitigating noise. We generalize this approach by establishing a theoretical link between arbitrary preference losses and their corresponding probabilistic models, elevating LCPO from a specific algorithm to a general framework for robust preference alignment. Theoretically, we prove that under the condition of a perfectly calibrated model, LCPO is guaranteed to converge to the true noise level of the dataset. Our experiments demonstrate LCPO's effectiveness as a general framework, consistently enhancing four state-of-the-art alignment algorithms (DPO, IPO, SimPO, and CPO). When applied to Mistral and Llama 3 models, the LCPO-enhanced methods achieve substantial win rate gains on AlpacaEval 2 and Arena-Hard, with improvements of up to 7.0% on both benchmarks.
Chinese: 传统偏好对齐方法(如RLHF)错误假设人类偏好统一无噪声,而LCPO框架通过从噪声数据中学习潜在群体共识并自适应调整训练权重,有效解决了这一问题,在多个基准测试中显著提升了模型性能。
English: Standard preference alignment methods like RLHF incorrectly assume uniform and error-free human preferences, but the proposed LCPO framework effectively addresses this by learning latent consensus from noisy data and adaptively recalibrating training contributions, significantly boosting model performance in benchmarks.

Authors:Lukas Breitwieser, Ahmad Hesam, Abdullah Giray Yağlıkçı, Mohammad Sadrosadati, Fons Rademakers, Onur Mutlu
Title: TeraAgent: A Distributed Agent-Based Simulation Engine for Simulating Half a Trillion Agents
Abstract:
Agent-based simulation is an indispensable paradigm for studying complex systems. These systems can comprise billions of agents, requiring the computing resources of multiple servers to simulate. Unfortunately, the state-of-the-art platform, BioDynaMo, does not scale out across servers due to its shared-memory-based implementation. To overcome this key limitation, we introduce TeraAgent, a distributed agent-based simulation engine. A critical challenge in distributed execution is the exchange of agent information across servers, which we identify as a major performance bottleneck. We propose two solutions: 1) a tailored serialization mechanism that allows agents to be accessed and mutated directly from the receive buffer, and 2) leveraging the iterative nature of agent-based simulations to reduce data transfer with delta encoding. Built on our solutions, TeraAgent enables extreme-scale simulations with half a trillion agents (an 84x improvement), reduces time-to-result with additional compute nodes, improves interoperability with third-party tools, and provides users with more hardware flexibility.
中文: TeraAgent是一种分布式基于代理的仿真引擎,通过优化的序列化和增量编码技术克服了BioDynaMo的可扩展性限制,实现了支持五千亿代理的极大规模仿真并提升了性能。
English: TeraAgent is a distributed agent-based simulation engine that overcomes the scalability limitations of BioDynaMo through optimized serialization and delta encoding, enabling extreme-scale simulations with half a trillion agents and improved performance.

Authors:Ziqi Wang, Boye Niu, Zhongli Li, Linghui Meng, Jing Liu, Zhi Zheng, Tong Xu, Hua Wu, Haifeng Wang, Enhong Chen
Title: A2R: An Asymmetric Two-Stage Reasoning Framework for Parallel Reasoning
Abstract:
Recent Large Reasoning Models have achieved significant improvements in complex task-solving capabilities by allocating more computation at the inference stage with a "thinking longer" paradigm. Even as the foundational reasoning capabilities of models advance rapidly, the persistent gap between a model's performance in a single attempt and its latent potential, often revealed only across multiple solution paths, starkly highlights the disparity between its realized and inherent capabilities. To address this, we present A2R, an Asymmetric Two-Stage Reasoning framework designed to explicitly bridge the gap between a model's potential and its actual performance. In this framework, an "explorer" model first generates potential solutions in parallel through repeated sampling. Subsequently,a "synthesizer" model integrates these references for a more refined, second stage of reasoning. This two-stage process allows computation to be scaled orthogonally to existing sequential methods. Our work makes two key innovations: First, we present A2R as a plug-and-play parallel reasoning framework that explicitly enhances a model's capabilities on complex questions. For example, using our framework, the Qwen3-8B-distill model achieves a 75% performance improvement compared to its self-consistency baseline. Second, through a systematic analysis of the explorer and synthesizer roles, we identify an effective asymmetric scaling paradigm. This insight leads to A2R-Efficient, a "small-to-big" variant that combines a Qwen3-4B explorer with a Qwen3-8B synthesizer. This configuration surpasses the average performance of a monolithic Qwen3-32B model at a nearly 30% lower cost. Collectively, these results show that A2R is not only a performance-boosting framework but also an efficient and practical solution for real-world applications.
中文: A2R框架采用非对称双阶段推理,通过探索器并行生成方案与合成器二次精炼,显著提升模型性能与效率。
English: The A2R framework introduces an asymmetric two-stage reasoning process where an explorer generates multiple solutions and a synthesizer refines them, significantly boosting model performance and efficiency.

Authors:Zikun Guo, Xinyue Xu, Pei Xiang, Shu Yang, Xin Han, Di Wang, Lijie Hu
Title: Benchmarking and Mitigate Psychological Sycophancy in Medical Vision-Language Models
Abstract:
Vision language models(VLMs) are increasingly integrated into clinical workflows, but they often exhibit sycophantic behavior prioritizing alignment with user phrasing social cues or perceived authority over evidence based reasoning. This study evaluate clinical sycophancy in medical visual question answering through a novel clinically grounded benchmark. We propose a medical sycophancy dataset construct from PathVQA, SLAKE, and VQA-RAD stratified by different type organ system and modality. Using psychologically motivated pressure templates including various sycophancy. In our adversarial experiments on various VLMs, we found that these models are generally vulnerable, exhibiting significant variations in the occurrence of adversarial responses, with weak correlations to the model accuracy or size. Imitation and expert provided corrections were found to be the most effective triggers, suggesting that the models possess a bias mechanism independent of visual evidence. To address this, we propose Visual Information Purification for Evidence based Response (VIPER) a lightweight mitigation strategy that filters non evidentiary content for example social pressures and then generates constrained evidence first answers. This framework reduces sycophancy by an average amount outperforming baselines while maintaining interpretability. Our benchmark analysis and mitigation framework lay the groundwork for robust deployment of medical VLMs in real world clinician interactions emphasizing the need for evidence anchored defenses.
中文摘要:视觉语言模型在临床应用中常表现出迎合用户而非基于证据推理的倾向,本研究通过建立基准并采用VIPER轻量级缓解策略,有效减少模型盲从行为,同时保持可解释性。
English Summary: Vision language models in clinical settings often prioritize user agreement over evidence-based reasoning, but this study introduces a benchmark and the VIPER mitigation strategy to reduce sycophantic behavior while maintaining interpretability.

Authors:Gaole Dai, Shiqi Jiang, Ting Cao, Yuqing Yang, Yuanchun Li, Rui Tan, Mo Li, Lili Qiu
Title: ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration
Abstract:
Reward is critical to the evaluation and training of large language models (LLMs). However, existing rule-based or model-based reward methods struggle to generalize to GUI agents, where access to ground-truth trajectories or application databases is often unavailable, and static trajectory-based LLM-as-a-Judge approaches suffer from limited accuracy. To address these challenges, we propose ProRe, a proactive reward system that leverages a general-purpose reasoner and domain-specific evaluator agents (actors). The reasoner schedules targeted state probing tasks, which the evaluator agents then execute by actively interacting with the environment to collect additional observations. This enables the reasoner to assign more accurate and verifiable rewards to GUI agents. Empirical results on over 3K trajectories demonstrate that ProRe improves reward accuracy and F1 score by up to 5.3% and 19.4%, respectively. Furthermore, integrating ProRe with state-of-the-art policy agents yields a success rate improvement of up to 22.4%.
Chinese: ProRe是一种主动奖励系统,通过推理器调度探测任务和评估代理收集环境观察,显著提升了GUI代理的奖励准确性、F1分数和成功率。
English: ProRe is a proactive reward system that enhances GUI agent training by employing a reasoner to schedule probing tasks and evaluator agents to gather environmental observations, achieving significant improvements in reward accuracy, F1 score, and agent success rates.

Authors:Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Title: Bounds of Chain-of-Thought Robustness: Reasoning Steps, Embed Norms, and Beyond
Abstract:
Existing research indicates that the output of Chain-of-Thought (CoT) is significantly affected by input perturbations. Although many methods aim to mitigate such impact by optimizing prompts, a theoretical explanation of how these perturbations influence CoT outputs remains an open area of research. This gap limits our in-depth understanding of how input perturbations propagate during the reasoning process and hinders further improvements in prompt optimization methods. Therefore, in this paper, we theoretically analyze the effect of input perturbations on the fluctuation of CoT outputs. We first derive an upper bound for input perturbations under the condition that the output fluctuation is within an acceptable range, based on which we prove that: (i) This upper bound is positively correlated with the number of reasoning steps in the CoT; (ii) Even an infinitely long reasoning process cannot eliminate the impact of input perturbations. We then apply these conclusions to the Linear Self-Attention (LSA) model, which can be viewed as a simplified version of the Transformer. For the LSA model, we prove that the upper bound for input perturbation is negatively correlated with the norms of the input embedding and hidden state vectors. To validate this theoretical analysis, we conduct experiments on three mainstream datasets and four mainstream models. The experimental results align with our theoretical analysis, empirically demonstrating the correctness of our findings.
中文: 本文从理论上分析了输入扰动对思维链输出的影响,证明了扰动影响随推理步骤增加而加剧且无法消除,同时推导了扰动上限并通过实验验证了结论。
English: This paper theoretically analyzes how input perturbations affect Chain-of-Thought outputs, proving that their impact increases with reasoning steps and persists indefinitely, while also deriving perturbation bounds and validating findings through experiments.

Authors:Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen
Title: When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models
Abstract:
Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.
中文摘要:长上下文监督微调意外地通过改进注意力机制和前馈网络提升了大型语言模型在短上下文任务中的表现,但会产生知识偏好偏差,而混合训练方法能有效缓解这一问题。
English Summary: Long-context supervised fine-tuning unexpectedly enhances LLM performance on short-context tasks by improving both attention and feed-forward components, though it creates a knowledge bias that hybrid training can resolve.

Authors:Yingming Zheng, Hanqi Li, Kai Yu, Lu Chen
Title: When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models
Abstract:
Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.
中文摘要:长上下文监督微调意外地通过改进注意力机制和前馈网络提升了大型语言模型在短上下文任务中的表现,但会产生知识偏好偏差,而混合训练方法能有效缓解这一问题。
English Summary: Long-context supervised fine-tuning unexpectedly enhances LLM performance on short-context tasks by improving both attention and feed-forward components, though it creates a knowledge bias that hybrid training can resolve.

Authors:Tiancheng Huang, Ruisheng Cao, Yuxin Zhang, Zhangyi Kang, Zijian Wang, Chenrun Wang, Yijie Luo, Hang Zheng, Lirong Qian, Lu Chen, Kai Yu
Title: AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
Abstract:
The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this specific task is hindered by the shortage of high-quality interaction trajectories. In this work, we propose AirQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,948 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open-source and proprietary models show that most models underperform on AirQA, demonstrating the quality of our dataset. Extensive experiments confirm that ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
中文: AirQA数据集通过提供人工标注、多任务的人工智能问答数据,解决了评估科学论文问答中基于大语言模型的智能体缺乏全面基准的问题,而ExTrActor框架则通过自动化指令合成,有效提升了小模型的多轮工具使用能力,使其性能可比肩大模型。
English: The AirQA dataset addresses the lack of a comprehensive benchmark for evaluating LLM-based agents in scientific paper question answering by providing human-annotated, multi-task AI data, while the ExTrActor framework automates instruction synthesis to enhance small models' tool-use capabilities to rival larger ones.

Authors:Wataru Hashimoto, Hidetaka Kamigaito, Taro Watanabe
Title: Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models
Abstract:
Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of decoding strategies on uncertainty estimation in Large Language Models (LLMs). Our experiments show that Contrastive Search, which mitigates repetition, yields better uncertainty estimates on average across a range of preference-aligned LLMs. In contrast, the benefits of these strategies sometimes diverge when the model is only post-trained with supervised fine-tuning, i.e. without explicit alignment.
中文摘要:对比搜索通过减少重复,在已对齐的大型语言模型中普遍能提升不确定性估计效果,但在未经明确对齐的模型中其优势表现不一。
English Summary: Contrastive Search generally improves uncertainty estimation in aligned large language models by reducing repetition, but its effectiveness varies in models without explicit alignment.

Authors:Yangning Li, Tingwei Lu, Yinghui Li, Yankai Chen, Wei-Chieh Huang, Wenhao Jiang, Hui Wang, Hai-Tao Zheng, Philip S. Yu
Title: Teaching According to Talents! Instruction Tuning LLMs with Competence-Aware Curriculum Learning
Abstract:
Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.
Chinese: CAMPUS框架提出了一种动态多视角课程学习方法,通过根据模型能力自适应调整训练数据难度,克服了静态方法的僵化问题,实现了卓越的指令调优性能。
English: The CAMPUS framework introduces a dynamic, multi-perspective curriculum learning approach for instruction tuning, which adaptively adjusts training data difficulty based on model competency to overcome the rigidity of static methods and achieve superior performance.

Authors:Shang Qin, Jingheng Ye, Yinghui Li, Hai-Tao Zheng, Qi Li, Jinxiao Shan, Zhixing Li, Hong-Gee Kim
Title: CL$^2$GEC: A Multi-Discipline Benchmark for Continual Learning in Chinese Literature Grammatical Error Correction
Abstract:
The growing demand for automated writing assistance in diverse academic domains highlights the need for robust Chinese Grammatical Error Correction (CGEC) systems that can adapt across disciplines. However, existing CGEC research largely lacks dedicated benchmarks for multi-disciplinary academic writing, overlooking continual learning (CL) as a promising solution to handle domain-specific linguistic variation and prevent catastrophic forgetting. To fill this crucial gap, we introduce CL$^2$GEC, the first Continual Learning benchmark for Chinese Literature Grammatical Error Correction, designed to evaluate adaptive CGEC across multiple academic fields. Our benchmark includes 10,000 human-annotated sentences spanning 10 disciplines, each exhibiting distinct linguistic styles and error patterns. CL$^2$GEC focuses on evaluating grammatical error correction in a continual learning setting, simulating sequential exposure to diverse academic disciplines to reflect real-world editorial dynamics. We evaluate large language models under sequential tuning, parameter-efficient adaptation, and four representative CL algorithms, using both standard GEC metrics and continual learning metrics adapted to task-level variation. Experimental results reveal that regularization-based methods mitigate forgetting more effectively than replay-based or naive sequential approaches. Our benchmark provides a rigorous foundation for future research in adaptive grammatical error correction across diverse academic domains.
中文:本研究推出了首个中文语法纠错持续学习基准CL$^2$GEC,通过包含10个学科的1万条标注语句评估跨领域适应能力,实验表明基于正则化的方法在连续学习环境中能最有效缓解遗忘问题。
English: The study introduces CL$^2$GEC, the first continual learning benchmark for Chinese grammatical error correction, designed to evaluate adaptive systems across 10 academic disciplines using 10,000 annotated sentences, revealing that regularization-based methods best mitigate forgetting in sequential learning scenarios.

Authors:Felix Wang, Boyu Chen, Kerun Xu, Bo Tang, Feiyu Xiong, Zhiyu Li
Title: Text2Mem: A Unified Memory Operation Language for Memory Operating System
Abstract:
Large language model agents increasingly depend on memory to sustain long horizon interaction, but existing frameworks remain limited. Most expose only a few basic primitives such as encode, retrieve, and delete, while higher order operations like merge, promote, demote, split, lock, and expire are missing or inconsistently supported. Moreover, there is no formal and executable specification for memory commands, leaving scope and lifecycle rules implicit and causing unpredictable behavior across systems. We introduce Text2Mem, a unified memory operation language that provides a standardized pathway from natural language to reliable execution. Text2Mem defines a compact yet expressive operation set aligned with encoding, storage, and retrieval. Each instruction is represented as a JSON based schema instance with required fields and semantic invariants, which a parser transforms into typed operation objects with normalized parameters. A validator ensures correctness before execution, while adapters map typed objects either to a SQL prototype backend or to real memory frameworks. Model based services such as embeddings or summarization are integrated when required. All results are returned through a unified execution contract. This design ensures safety, determinism, and portability across heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark that separates schema generation from backend execution to enable systematic evaluation. Together, these components establish the first standardized foundation for memory control in agents.
中文: 摘要介绍了Text2Mem这一统一内存操作语言,它将自然语言指令标准化为可靠执行,通过提供丰富的操作集和计划中的评估基准,解决了现有内存框架的局限性,确保跨系统安全性和可移植性。
English: The abstract introduces Text2Mem, a unified memory operation language that standardizes natural language commands into reliable executions with safety and portability across systems, addressing limitations in existing memory frameworks by providing expressive operations and a planned benchmark for evaluation.

Authors:Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che
Title: Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
Abstract:
Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model's embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens' attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.
中文: 针对KV缓存淘汰方法过度关注局部信息的问题,Judge Q通过引入软标记列表并以低成本微调嵌入层来捕捉全局信息,在LongBench和RULER等基准测试中性能显著提升且训练开销极小。
English: To address the limitations of local information focus in KV cache eviction methods, Judge Q introduces a soft token list that captures global information through low-cost embedding layer tuning, improving performance on benchmarks like LongBench and RULER with minimal training overhead.

Authors:Zhizheng Wang, Yifan Yang, Qiao Jin, Zhiyong Lu
Title: Gene-R1: Reasoning with Data-Augmented Lightweight LLMs for Gene Set Analysis
Abstract:
The gene set analysis (GSA) is a foundational approach for uncovering the molecular functions associated with a group of genes. Recently, LLM-powered methods have emerged to annotate gene sets with biological functions together with coherent explanatory insights. However, existing studies primarily focus on proprietary models, which have been shown to outperform their open-source counterparts despite concerns over cost and data privacy. Furthermore, no research has investigated the application of advanced reasoning strategies to the GSA task. To address this gap, we introduce Gene-R1, a data-augmented learning framework that equips lightweight and open-source LLMs with step-by-step reasoning capabilities tailored to GSA. Experiments on 1,508 in-distribution gene sets demonstrate that Gene-R1 achieves substantial performance gains, matching commercial LLMs. On 106 out-of-distribution gene sets, Gene-R1 performs comparably to both commercial and large-scale LLMs, exhibiting robust generalizability across diverse gene sources.
中文: Gene-R1是一种创新的数据增强学习框架,它赋予轻量级开源大语言模型分步推理能力来执行基因集分析,在保持跨基因来源强泛化能力的同时,实现了与商业模型相媲美的性能表现。
English: Gene-R1 is a novel data-augmented learning framework that enables lightweight, open-source LLMs to perform gene set analysis with step-by-step reasoning, achieving performance comparable to commercial models while ensuring robust generalizability across diverse gene sources.

Authors:Shuocheng Li, Yihao Liu, Silin Du, Wenxuan Zeng, Zhe Xu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Dongmei Zhang
Title: Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search
Abstract:
Large language models (LLMs) have shown great promise in automating data science workflows, but existing models still struggle with multi-step reasoning and tool use, which limits their effectiveness on complex data analysis tasks. To address this, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task-solution pairs that reflect authentic tool-use patterns in practical data science scenarios. To further enhance multi-step reasoning, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi-step plans with minimal search steps. Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent-DABench, respectively-matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks.
中文摘要:提出的Jupiter框架和NbQA数据集增强了大型语言模型在数据科学中的多步推理和工具使用能力,在复杂分析任务上达到或超越了GPT-4o的性能表现。
English Summary: The proposed Jupiter framework and NbQA dataset enhance large language models' multi-step reasoning and tool-use capabilities in data science, achieving performance comparable to or surpassing GPT-4o on complex analysis tasks.

Authors:Tiancheng Yang, Lin Zhang, Jiaye Lin, Guimin Hu, Di Wang, Lijie Hu
Title: D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLMs via Layer-to-head Attention Diagnostics
Abstract:
Multimodal Large Language Models (MLLMs) achieve strong performance on tasks like image captioning and visual question answering, but remain prone to hallucinations, where generated text conflicts with the visual input. Prior work links this partly to insufficient visual attention, but existing attention-based detectors and mitigation typically apply uniform adjustments across layers and heads, obscuring where errors originate. In this paper, we first show these methods fail to accurately localize problematic layers. Then, we introduce two diagnostics: Layer Image Attention Entropy (LIAE) which flags anomalous layers, and Image Attention Focus (IAF) which scores attention heads within those layers. Analysis shows that LIAE pinpoints faulty layers and IAF reliably ranks heads that warrant correction. Guided by these signals, we propose Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF), a task-agnostic, attention-guided method that dynamically localizes and corrects errors during inference with negligible overhead. Results show our D-LEAF delivers a 53% relative improvement on standard captioning benchmarks, and on VQA both accuracy and F1-score improve by approximately 4%, substantially suppressing hallucinations while preserving efficiency.
中文: 本文提出的D-LEAF方法通过分析层级和注意力头,能动态定位并修正多模态大语言模型中的幻觉错误,在图像描述和视觉问答任务上取得显著性能提升且保持高效。
English: This paper introduces D-LEAF, a dynamic method that precisely identifies and corrects hallucination errors in MLLMs by analyzing layer and head attention, achieving significant improvements in captioning and VQA tasks with minimal overhead.

Authors:Omri Sgan Cohen, Ehud Malul, Yair Meidan, Dudu Mimran, Yuval Elovici, Asaf Shabtai
Title: KubeGuard: LLM-Assisted Kubernetes Hardening via Configuration Files and Runtime Logs Analysis
Abstract:
The widespread adoption of Kubernetes (K8s) for orchestrating cloud-native applications has introduced significant security challenges, such as misconfigured resources and overly permissive configurations. Failing to address these issues can result in unauthorized access, privilege escalation, and lateral movement within clusters. Most existing K8s security solutions focus on detecting misconfigurations, typically through static analysis or anomaly detection. In contrast, this paper presents KubeGuard, a novel runtime log-driven recommender framework aimed at mitigating risks by addressing overly permissive configurations. KubeGuard is designed to harden K8s environments through two complementary tasks: Resource Creation and Resource Refinement. It leverages large language models (LLMs) to analyze manifests and runtime logs reflecting actual system behavior, using modular prompt-chaining workflows. This approach enables KubeGuard to create least-privilege configurations for new resources and refine existing manifests to reduce the attack surface. KubeGuard's output manifests are presented as recommendations that users (e.g., developers and operators) can review and adopt to enhance cluster security. Our evaluation demonstrates that KubeGuard effectively generates and refines K8s manifests for Roles, NetworkPolicies, and Deployments, leveraging both proprietary and open-source LLMs. The high precision, recall, and F1-scores affirm KubeGuard's practicality as a framework that translates runtime observability into actionable, least-privilege configuration guidance.
中文: 本文提出KubeGuard框架,通过大语言模型分析运行时日志生成最小权限的Kubernetes配置建议,能有效解决过度授权导致的安全风险并提升集群防护能力。
English: This paper introduces KubeGuard, a runtime log-driven framework that uses large language models to generate and refine least-privilege Kubernetes configurations, effectively mitigating security risks from overly permissive settings through actionable recommendations.

Authors:Ashmari Pramodya, Nirasha Nelki, Heshan Shalinda, Chamila Liyanage, Yusuke Sakai, Randil Pushpananda, Ruvan Weerasinghe, Hidetaka Kamigaito, Taro Watanabe
Title: SinhalaMMLU: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in Sinhala
Abstract:
Large Language Models (LLMs) demonstrate impressive general knowledge and reasoning abilities, yet their evaluation has predominantly focused on global or anglocentric subjects, often neglecting low-resource languages and culturally specific content. While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context. To address this, we introduce SinhalaMMLU, the first multiple-choice question answering benchmark designed specifically for Sinhala, a low-resource language. The dataset includes over 7,000 questions spanning secondary to collegiate education levels, aligned with the Sri Lankan national curriculum, and covers six domains and 30 subjects, encompassing both general academic topics and culturally grounded knowledge. We evaluate 26 LLMs on SinhalaMMLU and observe that, while Claude 3.5 sonnet and GPT-4o achieve the highest average accuracies at 67% and 62% respectively, overall model performance remains limited. In particular, models struggle in culturally rich domains such as the Humanities, revealing substantial room for improvement in adapting LLMs to low-resource and culturally specific contexts.
Chinese: 本文推出了首个针对僧伽罗语的大语言模型评测基准SinhalaMMLU,发现尽管Claude 3.5 sonnet和GPT-4o表现最佳,但所有模型在低资源语言和文化特定知识处理上仍存在明显不足。
English: This paper introduces SinhalaMMLU, the first comprehensive benchmark for evaluating large language models on Sinhala, revealing their limitations in handling low-resource languages and culturally specific knowledge despite Claude 3.5 sonnet and GPT-4o achieving the highest scores.

Authors:Mohit Mendiratta, Mayur Deshmukh, Kartik Teotia, Vladislav Golyanik, Adam Kortylewski, Christian Theobalt
Title: GRMM: Real-Time High-Fidelity Gaussian Morphable Head Model with Learned Residuals
Abstract:
3D Morphable Models (3DMMs) enable controllable facial geometry and expression editing for reconstruction, animation, and AR/VR, but traditional PCA-based mesh models are limited in resolution, detail, and photorealism. Neural volumetric methods improve realism but remain too slow for interactive use. Recent Gaussian Splatting (3DGS) based facial models achieve fast, high-quality rendering but still depend solely on a mesh-based 3DMM prior for expression control, limiting their ability to capture fine-grained geometry, expressions, and full-head coverage. We introduce GRMM, the first full-head Gaussian 3D morphable model that augments a base 3DMM with residual geometry and appearance components, additive refinements that recover high-frequency details such as wrinkles, fine skin texture, and hairline variations. GRMM provides disentangled control through low-dimensional, interpretable parameters (e.g., identity shape, facial expressions) while separately modelling residuals that capture subject- and expression-specific detail beyond the base model's capacity. Coarse decoders produce vertex-level mesh deformations, fine decoders represent per-Gaussian appearance, and a lightweight CNN refines rasterised images for enhanced realism, all while maintaining 75 FPS real-time rendering. To learn consistent, high-fidelity residuals, we present EXPRESS-50, the first dataset with 60 aligned expressions across 50 identities, enabling robust disentanglement of identity and expression in Gaussian-based 3DMMs. Across monocular 3D face reconstruction, novel-view synthesis, and expression transfer, GRMM surpasses state-of-the-art methods in fidelity and expression accuracy while delivering interactive real-time performance.
中文: GRMM是首个全头高斯3D可形变模型,通过增强基础3DMM的残差组件来捕捉皱纹和毛发等高频细节,以75 FPS实现实时渲染,并在面部重建和表情任务中表现卓越。
English: GRMM is the first full-head Gaussian 3D morphable model that enhances a base 3DMM with residual components to capture high-frequency details like wrinkles and hair, providing real-time rendering at 75 FPS and superior performance in facial reconstruction and expression tasks.

Authors:Yotam Erel, Rishabh Dabral, Vladislav Golyanik, Amit H. Bermano, Christian Theobalt
Title: PractiLight: Practical Light Control Using Foundational Diffusion Models
Abstract:
Light control in generated images is a difficult task, posing specific challenges, spanning over the entire image and frequency spectrum. Most approaches tackle this problem by training on extensive yet domain-specific datasets, limiting the inherent generalization and applicability of the foundational backbones used. Instead, PractiLight is a practical approach, effectively leveraging foundational understanding of recent generative models for the task. Our key insight is that lighting relationships in an image are similar in nature to token interaction in self-attention layers, and hence are best represented there. Based on this and other analyses regarding the importance of early diffusion iterations, PractiLight trains a lightweight LoRA regressor to produce the direct irradiance map for a given image, using a small set of training images. We then employ this regressor to incorporate the desired lighting into the generation process of another image using Classifier Guidance. This careful design generalizes well to diverse conditions and image domains. We demonstrate state-of-the-art performance in terms of quality and control with proven parameter and data efficiency compared to leading works over a wide variety of scenes types. We hope this work affirms that image lighting can feasibly be controlled by tapping into foundational knowledge, enabling practical and general relighting.
Chinese Summary: PractiLight提出了一种实用的图像光照控制方法,通过利用生成模型的基础知识,采用轻量级LoRA回归器生成辐照度图,并应用分类器引导技术,实现在多种场景下高效且通用的重光照效果。
English Summary: PractiLight introduces a practical method for controlling image lighting by leveraging generative models' foundational knowledge, using a lightweight LoRA regressor to produce irradiance maps and applying Classifier Guidance for effective and generalized relighting across diverse scenes.

Authors:Songze Li, Zun Wang, Gengze Zhou, Jialu Li, Xiangyu Zeng, Limin Wang, Yu Qiao, Qi Wu, Mohit Bansal, Yi Wang
Title: Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale
Abstract:
Goal-oriented language-guided navigation requires robust exploration capabilities for agents to navigate to specified goals in unknown environments without step-by-step instructions. Existing methods tend to exclusively utilize shortest-path trajectories, lacking effective exploration priors for training navigation agents. To address the above challenges, we present SID, a goal-oriented language-guided navigation learning approach with Self-Improving Demonstrations. Specifically, SID learns an initial agent on the shortest-path data sampled from environments and then leverages this agent to generate novel exploration trajectories. The novel rollouts provide demonstrations with stronger exploration strategies to train a better agent, which in turn produces higher-quality agent demonstrations for the next round of training. We show that this iterative self-improving pipeline readily scales to new environments, and the resulting demonstrations can be transferred across a variety of language-guided navigation tasks, elevating the performance ceiling in diverse goal-oriented navigation tasks. Extensive experiments demonstrate that SID significantly boosts the exploration capabilities and generalization of navigation agents. The resulting agent achieves new state-of-the-art performance on goal-oriented language-guided navigation tasks, including REVERIE, SOON, notably achieving a 50.9% success rate on the unseen validation splits of SOON, surpassing the prior leading approaches by a margin of 13.9%.
中文摘要:SID提出了一种目标导向语言导航的自改进学习方法,通过迭代生成探索轨迹来增强智能体训练,在SOON未见验证集上以50.9%的成功率实现了最先进的性能表现。
English Summary: SID introduces a self-improving learning approach for goal-oriented language-guided navigation that iteratively generates novel exploration trajectories to enhance agent training, achieving state-of-the-art performance with a 50.9% success rate on SOON's unseen validation splits.

Authors:Yukai Zhao, Menghan Wu, Xing Hu, Xin Xia
Title: HFuzzer: Testing Large Language Models for Package Hallucinations via Phrase-based Fuzzing
Abstract:
Large Language Models (LLMs) are widely used for code generation, but they face critical security risks when applied to practical production due to package hallucinations, in which LLMs recommend non-existent packages. These hallucinations can be exploited in software supply chain attacks, where malicious attackers exploit them to register harmful packages. It is critical to test LLMs for package hallucinations to mitigate package hallucinations and defend against potential attacks. Although researchers have proposed testing frameworks for fact-conflicting hallucinations in natural language generation, there is a lack of research on package hallucinations. To fill this gap, we propose HFUZZER, a novel phrase-based fuzzing framework to test LLMs for package hallucinations. HFUZZER adopts fuzzing technology and guides the model to infer a wider range of reasonable information based on phrases, thereby generating enough and diverse coding tasks. Furthermore, HFUZZER extracts phrases from package information or coding tasks to ensure the relevance of phrases and code, thereby improving the relevance of generated tasks and code. We evaluate HFUZZER on multiple LLMs and find that it triggers package hallucinations across all selected models. Compared to the mutational fuzzing framework, HFUZZER identifies 2.60x more unique hallucinated packages and generates more diverse tasks. Additionally, when testing the model GPT-4o, HFUZZER finds 46 unique hallucinated packages. Further analysis reveals that for GPT-4o, LLMs exhibit package hallucinations not only during code generation but also when assisting with environment configuration.
中文: 针对大语言模型在代码生成中的包幻觉安全风险,我们提出了HFUZZER这一基于短语的模糊测试框架,能有效生成多样化编程任务,并比现有方法识别出更多独特的幻觉包。
English: To address the security risks of package hallucinations in LLMs for code generation, we propose HFUZZER, a phrase-based fuzzing framework that effectively generates diverse coding tasks and identifies significantly more unique hallucinated packages than existing methods.

Authors:Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, Yi Wang
Title: VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception
Abstract:
Inducing reasoning in multimodal large language models (MLLMs) is critical for achieving human-level perception and understanding. Existing methods mainly leverage LLM reasoning to analyze parsed visuals, often limited by static perception stages. This paper introduces Visual Test-Time Scaling (VTTS), a novel approach to enhance MLLMs' reasoning via iterative perception during inference. VTTS mimics humans' hierarchical attention by progressively refining focus on high-confidence spatio-temporal regions, guided by updated textual predictions. Specifically, VTTS employs an Iterative Perception (ITP) mechanism, incorporating reinforcement learning with spatio-temporal supervision to optimize reasoning. To support this paradigm, we also present VTTS-80K, a dataset tailored for iterative perception. These designs allows a MLLM to enhance its performance by increasing its perceptual compute. Extensive experiments validate VTTS's effectiveness and generalization across diverse tasks and benchmarks. Our newly introduced Videochat-R1.5 model has achieved remarkable improvements, with an average increase of over 5\%, compared to robust baselines such as Qwen2.5VL-3B and -7B, across more than 15 benchmarks that encompass video conversation, video reasoning, and spatio-temporal perception.
中文: 本文提出视觉测试时扩展(VTTS)方法,通过迭代感知和分层注意力机制增强多模态大语言模型的推理能力,在多项基准测试中实现显著性能提升。
English: This paper introduces Visual Test-Time Scaling (VTTS), a novel approach that enhances multimodal large language models' reasoning through iterative perception and hierarchical attention, validated by significant performance improvements across diverse benchmarks.

Authors:Zhenguo Sun, Yibo Peng, Yuan Meng, Xukun Li, Bo-Sheng Huang, Zhenshan Bing, Xinlong Wang, Alois Knoll
Title: RobotDancing: Residual-Action Reinforcement Learning Enables Robust Long-Horizon Humanoid Motion Tracking
Abstract:
Long-horizon, high-dynamic motion tracking on humanoids remains brittle because absolute joint commands cannot compensate model-plant mismatch, leading to error accumulation. We propose RobotDancing, a simple, scalable framework that predicts residual joint targets to explicitly correct dynamics discrepancies. The pipeline is end-to-end--training, sim-to-sim validation, and zero-shot sim-to-real--and uses a single-stage reinforcement learning (RL) setup with a unified observation, reward, and hyperparameter configuration. We evaluate primarily on Unitree G1 with retargeted LAFAN1 dance sequences and validate transfer on H1/H1-2. RobotDancing can track multi-minute, high-energy behaviors (jumps, spins, cartwheels) and deploys zero-shot to hardware with high motion tracking quality.
中文:RobotDancing提出了一种通过预测残差关节目标来修正人形机器人动态差异的框架,实现了零样本从仿真到实物的高动态舞蹈动作跟踪。
English: RobotDancing introduces a scalable framework using residual joint targets to correct dynamics mismatches in humanoid motion tracking, enabling zero-shot sim-to-real deployment for high-energy dance sequences.

Authors:Yuan Meng, Zhenguo Sun, Max Fest, Xukun Li, Zhenshan Bing, Alois Knoll
Title: Growing with Your Embodied Agent: A Human-in-the-Loop Lifelong Code Generation Framework for Long-Horizon Manipulation Skills
Abstract:
Large language models (LLMs)-based code generation for robotic manipulation has recently shown promise by directly translating human instructions into executable code, but existing methods remain noisy, constrained by fixed primitives and limited context windows, and struggle with long-horizon tasks. While closed-loop feedback has been explored, corrected knowledge is often stored in improper formats, restricting generalization and causing catastrophic forgetting, which highlights the need for learning reusable skills. Moreover, approaches that rely solely on LLM guidance frequently fail in extremely long-horizon scenarios due to LLMs' limited reasoning capability in the robotic domain, where such issues are often straightforward for humans to identify. To address these challenges, we propose a human-in-the-loop framework that encodes corrections into reusable skills, supported by external memory and Retrieval-Augmented Generation with a hint mechanism for dynamic reuse. Experiments on Ravens, Franka Kitchen, and MetaWorld, as well as real-world settings, show that our framework achieves a 0.93 success rate (up to 27% higher than baselines) and a 42% efficiency improvement in correction rounds. It can robustly solve extremely long-horizon tasks such as "build a house", which requires planning over 20 primitives.
中文: 该研究提出的人机协同框架通过外部存储和检索增强生成将纠错编码为可复用技能,显著提升了机器人代码生成在长周期任务中的成功率和效率。
English: The proposed human-in-the-loop framework enhances robotic code generation by encoding corrections into reusable skills with external memory and retrieval-augmented generation, achieving higher success rates and efficiency in long-horizon tasks.

Authors:Haiming Zhang, Yiyao Zhu, Wending Zhou, Xu Yan, Yingjie Cai, Bingbing Liu, Shuguang Cui, Zhen Li
Title: SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving
Abstract:
Sparse Perception Models (SPMs) adopt a query-driven paradigm that forgoes explicit dense BEV or volumetric construction, enabling highly efficient computation and accelerated inference. In this paper, we introduce SQS, a novel query-based splatting pre-training specifically designed to advance SPMs in autonomous driving. SQS introduces a plug-in module that predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features through the reconstruction of multi-view images and depth maps. During fine-tuning, the pre-trained Gaussian queries are seamlessly integrated into downstream networks via query interaction mechanisms that explicitly connect pre-trained queries with task-specific queries, effectively accommodating the diverse requirements of occupancy prediction and 3D object detection. Extensive experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks, notably in occupancy prediction and 3D object detection, outperforming prior state-of-the-art pre-training approaches by a significant margin (i.e., +1.3 mIoU on occupancy prediction and +1.0 NDS on 3D detection).
中文: SQS是一种专为稀疏感知模型设计的创新查询式溅射预训练方法,通过3D高斯表示和多视图重建学习细粒度特征,显著提升了自动驾驶中占据预测和3D目标检测任务的性能表现。
English: SQS is a novel query-based splatting pre-training method for Sparse Perception Models that enhances autonomous driving tasks by learning fine-grained features through 3D Gaussian representations and multi-view reconstruction, achieving significant performance gains in occupancy prediction and 3D object detection.

Authors:Bo-Wen Yin, Jiao-Long Cao, Xuying Zhang, Yuming Chen, Ming-Ming Cheng, Qibin Hou
Title: OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
Abstract:
Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt, which contains five popular visual modalities. 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the ImageNeXt. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360.
中文:OmniSegmentor框架通过ImageNeXt数据集提出通用多模态预训练方法,利用多种视觉模态增强模型感知能力,在多个语义分割基准上实现了最先进的性能。
English: The OmniSegmentor framework introduces a universal multi-modal pretraining approach using the ImageNeXt dataset, achieving state-of-the-art results across multiple semantic segmentation benchmarks by enhancing model perception with diverse visual modalities.

Authors:Laura Ribeiro, Muhammad Shaheer, Miguel Fernandez-Cortizas, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez
Title: Human Interaction for Collaborative Semantic SLAM using Extended Reality
Abstract:
Semantic SLAM (Simultaneous Localization and Mapping) systems enrich robot maps with structural and semantic information, enabling robots to operate more effectively in complex environments. However, these systems struggle in real-world scenarios with occlusions, incomplete data, or ambiguous geometries, as they cannot fully leverage the higher-level spatial and semantic knowledge humans naturally apply. We introduce HICS-SLAM, a Human-in-the-Loop semantic SLAM framework that uses a shared extended reality environment for real-time collaboration. The system allows human operators to directly interact with and visualize the robot's 3D scene graph, and add high-level semantic concepts (e.g., rooms or structural entities) into the mapping process. We propose a graph-based semantic fusion methodology that integrates these human interventions with robot perception, enabling scalable collaboration for enhanced situational awareness. Experimental evaluations on real-world construction site datasets demonstrate improvements in room detection accuracy, map precision, and semantic completeness compared to automated baselines, demonstrating both the effectiveness of the approach and its potential for future extensions.
Chinese Summary: HICS-SLAM是一种人在回路的语义SLAM框架,通过扩展现实实现人机实时协作,在复杂环境中显著提升了机器人的建图精度和语义完整性。
English Summary: HICS-SLAM is a human-in-the-loop semantic SLAM framework that integrates real-time human interventions through extended reality to enhance robot mapping accuracy and semantic completeness in challenging environments.

Authors:Kerui Huang, Shuhan Liu, Xing Hu, Tongtong Xu, Lingfeng Bao, Xin Xia
Title: Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework
Abstract:
Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by prompting intermediate steps, improving accuracy and robustness in arithmetic, logic, and commonsense tasks. However, this benefit comes with high computational costs: longer outputs increase latency, memory usage, and KV-cache demands. These issues are especially critical in software engineering tasks where concise and deterministic outputs are required. To investigate these trade-offs, we conduct an empirical study based on code generation benchmarks. The results reveal that longer CoT does not always help. Excessive reasoning often causes truncation, accuracy drops, and latency up to five times higher, with failed outputs consistently longer than successful ones. These findings challenge the assumption that longer reasoning is inherently better and highlight the need for adaptive CoT control. Motivated by this, we propose SEER (Self-Enhancing Efficient Reasoning), an adaptive framework that compresses CoT while preserving accuracy. SEER combines Best-of-N sampling with task-aware adaptive filtering, dynamically adjusting thresholds based on pre-inference outputs to reduce verbosity and computational overhead. We then evaluate SEER on three software engineering tasks and one math task. On average, SEER shortens CoT by 42.1%, improves accuracy by reducing truncation, and eliminates most infinite loops. These results demonstrate SEER as a practical method to make CoT-enhanced LLMs more efficient and robust, even under resource constraints.
中文: 思维链推理虽提升大语言模型性能却带来高昂计算成本,为此提出的SEER自适应框架通过压缩推理步骤,在保持精度的同时平均缩短42.1%推理长度,有效降低延迟与资源消耗。
English: Chain-of-Thought reasoning improves LLM performance but incurs high computational costs, prompting the development of SEER, an adaptive framework that compresses reasoning steps to maintain accuracy while reducing latency and resource usage by 42.1% on average.

Authors:Asier Bikandi, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez
Title: BIM Informed Visual SLAM for Construction Monitoring
Abstract:
Simultaneous Localization and Mapping (SLAM) is a key tool for monitoring construction sites, where aligning the evolving as-built state with the as-planned design enables early error detection and reduces costly rework. LiDAR-based SLAM achieves high geometric precision, but its sensors are typically large and power-demanding, limiting their use on portable platforms. Visual SLAM offers a practical alternative with lightweight cameras already embedded in most mobile devices. however, visually mapping construction environments remains challenging: repetitive layouts, occlusions, and incomplete or low-texture structures often cause drift in the trajectory map. To mitigate this, we propose an RGB-D SLAM system that incorporates the Building Information Model (BIM) as structural prior knowledge. Instead of relying solely on visual cues, our system continuously establishes correspondences between detected wall and their BIM counterparts, which are then introduced as constraints in the back-end optimization. The proposed method operates in real time and has been validated on real construction sites, reducing trajectory error by an average of 23.71% and map RMSE by 7.14% compared to visual SLAM baselines. These results demonstrate that BIM constraints enable reliable alignment of the digital plan with the as-built scene, even under partially constructed conditions.
Chinese Summary: 该研究提出了一种结合建筑信息模型(BIM)作为结构先验的RGB-D SLAM系统,通过实时建立墙体与BIM的对应约束,在施工场景中将轨迹误差平均降低23.71%,地图均方根误差降低7.14%,有效解决了视觉SLAM在重复结构环境中的漂移问题。
English Summary: The study introduces an RGB-D SLAM system enhanced with Building Information Model (BIM) as structural prior to address visual SLAM challenges in construction, reducing trajectory error by 23.71% and map RMSE by 7.14% through real-time BIM correspondence constraints.

Authors:Asier Bikandi-Noya, Miguel Fernandez-Cortizas, Muhammad Shaheer, Ali Tourani, Holger Voos, Jose Luis Sanchez-Lopez
Title: BIM Informed Visual SLAM for Construction Monitoring
Abstract:
Simultaneous Localization and Mapping (SLAM) is a key tool for monitoring construction sites, where aligning the evolving as-built state with the as-planned design enables early error detection and reduces costly rework. LiDAR-based SLAM achieves high geometric precision, but its sensors are typically large and power-demanding, limiting their use on portable platforms. Visual SLAM offers a practical alternative with lightweight cameras already embedded in most mobile devices. however, visually mapping construction environments remains challenging: repetitive layouts, occlusions, and incomplete or low-texture structures often cause drift in the trajectory map. To mitigate this, we propose an RGB-D SLAM system that incorporates the Building Information Model (BIM) as structural prior knowledge. Instead of relying solely on visual cues, our system continuously establishes correspondences between detected wall and their BIM counterparts, which are then introduced as constraints in the back-end optimization. The proposed method operates in real time and has been validated on real construction sites, reducing trajectory error by an average of 23.71% and map RMSE by 7.14% compared to visual SLAM baselines. These results demonstrate that BIM constraints enable reliable alignment of the digital plan with the as-built scene, even under partially constructed conditions.
Chinese Summary: 该研究提出了一种结合建筑信息模型(BIM)作为结构先验的RGB-D SLAM系统,通过实时建立墙体与BIM的对应约束,在施工场景中将轨迹误差平均降低23.71%,地图均方根误差降低7.14%,有效解决了视觉SLAM在重复结构环境中的漂移问题。
English Summary: The study introduces an RGB-D SLAM system enhanced with Building Information Model (BIM) as structural prior to address visual SLAM challenges in construction, reducing trajectory error by 23.71% and map RMSE by 7.14% through real-time BIM correspondence constraints.

Authors:Junzhi She, Xunkai Li, Rong-Hua Li, Guoren Wang
Title: State Space Models over Directed Graphs
Abstract:
Directed graphs are ubiquitous across numerous domains, where the directionality of edges encodes critical causal dependencies. However, existing GNNs and graph Transformers tailored for directed graphs face two major challenges: (1) effectively capturing long-range causal dependencies derived from directed edges; (2) balancing accuracy and training efficiency when processing large-scale graph datasets. In recent years, state space models (SSMs) have achieved substantial progress in causal sequence tasks, and their variants designed for graphs have demonstrated state-of-the-art accuracy while maintaining high efficiency across various graph learning benchmarks. However, existing graph state space models are exclusively designed for undirected graphs, which limits their performance in directed graph learning. To this end, we propose an innovative approach DirEgo2Token which sequentializes directed graphs via k-hop ego graphs. This marks the first systematic extension of state space models to the field of directed graph learning. Building upon this, we develop DirGraphSSM, a novel directed graph neural network architecture that implements state space models on directed graphs via the message-passing mechanism. Experimental results demonstrate that DirGraphSSM achieves state-of-the-art performance on three representative directed graph learning tasks while attaining competitive performance on two additional tasks with 1.5$\times $ to 2$\times $ training speed improvements compared to existing state-of-the-art models.
Chinese: 基于状态空间模型的新型有向图神经网络DirGraphSSM,在多项任务中实现最优性能,同时训练效率较现有最佳模型提升1.5至2倍。
English: DirGraphSSM, a novel directed graph neural network based on state space models, achieves state-of-the-art performance on multiple tasks while significantly improving training efficiency by 1.5x to 2x compared to existing models.

Authors:MohammadHossien Alishahi, Ming Zeng, Paul Fortier, Omer Waqar, Muhammad Hanif, Dinh Thai Hoang, Diep N. Nguyen, Quoc-Viet Pham
Title: Efficient STAR-RIS Mode for Energy Minimization in WPT-FL Networks with NOMA
Abstract:
With the massive deployment of IoT devices in 6G networks, several critical challenges have emerged, such as large communication overhead, coverage limitations, and limited battery lifespan. FL, WPT, multi-antenna AP, and RIS can mitigate these challenges by reducing the need for large data transmissions, enabling sustainable energy harvesting, and optimizing the propagation environment. Compared to conventional RIS, STAR-RIS not only extends coverage from half-space to full-space but also improves energy saving through appropriate mode selection. Motivated by the need for sustainable, low-latency, and energy-efficient communication in large-scale IoT networks, this paper investigates the efficient STAR-RIS mode in the uplink and downlink phases of a WPT-FL multi-antenna AP network with non-orthogonal multiple access to minimize energy consumption, a joint optimization that remains largely unexplored in existing works on RIS or STAR-RIS. We formulate a non-convex energy minimization problem for different STAR-RIS modes, i.e., energy splitting (ES) and time switching (TS), in both uplink and downlink transmission phases, where STAR-RIS phase shift vectors, beamforming matrices, time and power for harvesting, uplink transmission, and downlink transmission, local processing time, and computation frequency for each user are jointly optimized. To tackle the non-convexity, the problem is decoupled into two subproblems: the first subproblem optimizes STAR-RIS phase shift vectors and beamforming matrices across all WPT-FL phases using block coordinate descent over either semi-definite programming or Rayleigh quotient problems, while the second one allocates time, power, and computation frequency via the one-dimensional search algorithms or the bisection algorithm.
中文摘要:本文研究在采用非正交多址接入的WPT-FL多天线接入点网络中,利用STAR-RIS技术优化上下行传输阶段的能耗最小化问题,解决了现有RIS或STAR-RIS研究中尚未深入探讨的联合优化难题。
English Summary: This paper explores the use of STAR-RIS technology in a WPT-FL multi-antenna AP network with non-orthogonal multiple access to minimize energy consumption in both uplink and downlink phases, addressing a joint optimization challenge not previously examined in RIS or STAR-RIS studies.

Authors:Xiangtong Yao, Yirui Zhou, Yuan Meng, Yanwen Liu, Liangyu Dong, Zitao Zhang, Zhenshan Bing, Kai Huang, Fuchun Sun, Alois Knoll
Title: Inference-stage Adaptation-projection Strategy Adapts Diffusion Policy to Cross-manipulators Scenarios
Abstract:
Diffusion policies are powerful visuomotor models for robotic manipulation, yet they often fail to generalize to manipulators or end-effectors unseen during training and struggle to accommodate new task requirements at inference time. Addressing this typically requires costly data recollection and policy retraining for each new hardware or task configuration. To overcome this, we introduce an adaptation-projection strategy that enables a diffusion policy to perform zero-shot adaptation to novel manipulators and dynamic task settings, entirely at inference time and without any retraining. Our method first trains a diffusion policy in SE(3) space using demonstrations from a base manipulator. During online deployment, it projects the policy's generated trajectories to satisfy the kinematic and task-specific constraints imposed by the new hardware and objectives. Moreover, this projection dynamically adapts to physical differences (e.g., tool-center-point offsets, jaw widths) and task requirements (e.g., obstacle heights), ensuring robust and successful execution. We validate our approach on real-world pick-and-place, pushing, and pouring tasks across multiple manipulators, including the Franka Panda and Kuka iiwa 14, equipped with a diverse array of end-effectors like flexible grippers, Robotiq 2F/3F grippers, and various 3D-printed designs. Our results demonstrate consistently high success rates in these cross-manipulator scenarios, proving the effectiveness and practicality of our adaptation-projection strategy. The code will be released after peer review.
中文: 提出的适应-投影策略通过将生成的轨迹投影以满足新硬件和目标的约束,使扩散策略能够在无需重新训练的情况下实现对新机械臂和动态任务要求的零样本适应。
English: The proposed adaptation-projection strategy enables diffusion policies to achieve zero-shot adaptation to novel manipulators and dynamic task requirements during inference without retraining, by projecting generated trajectories to meet new hardware and objective constraints.

Authors:Pedro Miguel Bastos Soares, Ali Tourani, Miguel Fernandez-Cortizas, Asier Bikandi Noya, Jose Luis Sanchez-Lopez, Holger Voos
Title: SMapper: A Multi-Modal Data Acquisition Platform for SLAM Benchmarking
Abstract:
Advancing research in fields like Simultaneous Localization and Mapping (SLAM) and autonomous navigation critically depends on reliable and reproducible multimodal datasets. While several influential datasets have driven progress in these domains, they often suffer from limitations in sensing modalities, environmental diversity, and the reproducibility of the underlying hardware setups. To address these challenges, this paper introduces SMapper, a novel open-hardware, multi-sensor platform designed explicitly for, though not limited to, SLAM research. The device integrates synchronized LiDAR, multi-camera, and inertial sensing, supported by a robust calibration and synchronization pipeline that ensures precise spatio-temporal alignment across modalities. Its open and replicable design allows researchers to extend its capabilities and reproduce experiments across both handheld and robot-mounted scenarios. To demonstrate its practicality, we additionally release SMapper-light, a publicly available SLAM dataset containing representative indoor and outdoor sequences. The dataset includes tightly synchronized multimodal data and ground-truth trajectories derived from offline LiDAR-based SLAM with sub-centimeter accuracy, alongside dense 3D reconstructions. Furthermore, the paper contains benchmarking results on state-of-the-art LiDAR and visual SLAM frameworks using the SMapper-light dataset. By combining open-hardware design, reproducible data collection, and comprehensive benchmarking, SMapper establishes a robust foundation for advancing SLAM algorithm development, evaluation, and reproducibility.
中文: 本文提出SMapper这一开源硬件多传感器平台,通过同步多模态感知和可复现设计解决现有SLAM数据集的局限性,并配套发布SMapper-light数据集及基准测试结果,为SLAM研究的推进提供支持。
English: This paper introduces SMapper, an open-hardware multi-sensor platform addressing limitations in existing SLAM datasets through synchronized multimodal sensing and reproducible design, complemented by the SMapper-light dataset and benchmarking results to advance SLAM research.

Authors:Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
Title: Two Facets of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models
Abstract:
Inspired by the success of LLMs, GFMs are designed to learn the optimal embedding functions from multi-domain text-attributed graphs for the downstream cross-task generalization capability. Among the diverse architectures, graph VQ-MAE stands out among the increasingly diverse landscape of GFM. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.
中文: 图基础模型旨在从多领域文本属性图中学习最优嵌入以实现跨任务泛化,但面临模型退化和表示坍缩等优化困境;提出的MoT框架通过增强信息容量和正则化有效解决了这些问题,在多种场景下实现了卓越性能。
English: GFMs are developed to learn optimal embeddings from multi-domain text-attributed graphs for cross-task generalization, but face challenges like model degradation and representation collapse that hinder optimization, which the proposed MoT framework addresses through enhanced information capacity and regularization to achieve superior performance across various scenarios.

Authors:Kun Zhai, Siheng Chen, Xingjun Ma, Yu-Gang Jiang
Title: FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models
Abstract:
Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (\textbf{FedAPT}), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a \textit{class information gap} between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a \textbf{class-aware prompt generator} that generates visual prompts from text prompts. This generator is guided by a \emph{Global Label Embedding} (serving as a ``beacon") which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a \textbf{cross-layer generator sharing} strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.
中文: FedAPT是一种新颖的联邦学习方法,通过引入基于全局标签嵌入的类别感知提示生成器和跨层共享策略,显著提升了视觉语言模型的对抗鲁棒性,在多种分类任务中大幅优于现有方法。
English: FedAPT is a novel federated learning method that enhances adversarial robustness in vision-language models by introducing a class-aware prompt generator guided by global label embeddings and a cross-layer sharing strategy, significantly outperforming existing approaches in various classification tasks.

Authors:Meihao Liao, Yueyang Pan, Rong-Hua Li, Guoren Wang
Title: Efficient Exact Resistance Distance Computation on Small-Treewidth Graphs: a Labelling Approach
Abstract:
Resistance distance computation is a fundamental problem in graph analysis, yet existing random walk-based methods are limited to approximate solutions and suffer from poor efficiency on small-treewidth graphs (e.g., road networks). In contrast, shortest-path distance computation achieves remarkable efficiency on such graphs by leveraging cut properties and tree decompositions. Motivated by this disparity, we first analyze the cut property of resistance distance. While a direct generalization proves impractical due to costly matrix operations, we overcome this limitation by integrating tree decompositions, revealing that the resistance distance $r(s,t)$ depends only on labels along the paths from $s$ and $t$ to the root of the decomposition. This insight enables compact labelling structures. Based on this, we propose \treeindex, a novel index method that constructs a resistance distance labelling of size $O(n \cdot h_{\mathcal{G}})$ in $O(n \cdot h_{\mathcal{G}}^2 \cdot d_{\max})$ time, where $h_{\mathcal{G}}$ (tree height) and $d_{\max}$ (maximum degree) behave as small constants in many real-world small-treewidth graphs (e.g., road networks). Our labelling supports exact single-pair queries in $O(h_{\mathcal{G}})$ time and single-source queries in $O(n \cdot h_{\mathcal{G}})$ time. Extensive experiments show that TreeIndex substantially outperforms state-of-the-art approaches. For instance, on the full USA road network, it constructs a $405$ GB labelling in $7$ hours (single-threaded) and answers exact single-pair queries in $10^{-3}$ seconds and single-source queries in $190$ seconds--the first exact method scalable to such large graphs.
Chinese: 针对小树宽图上电阻距离计算效率低的问题,我们提出了TreeIndex方法,通过利用树分解实现精确高效的查询,大幅优于现有技术。
English: To address the inefficiency of existing methods in computing resistance distance on small-treewidth graphs, we introduce TreeIndex, a novel labeling-based approach that enables exact and efficient queries by leveraging tree decompositions.

Authors:Weiyu Huang, Yuezhou Hu, Jun Zhu, Jianfei Chen
Title: CAST: Continuous and Differentiable Semi-Structured Sparsity-Aware Training for Large Language Models
Abstract:
Sparsity-aware training is an effective approach for transforming large language models (LLMs) into hardware-friendly sparse patterns, thereby reducing latency and memory consumption during inference. In this paper, we propose Continuous Adaptive Sparse Trainer (CAST), a fully continuous and differentiable sparsity-aware training framework for semi-structured (or "N:M") sparse models. Unlike previous approaches that optimize sparsity patterns and weights separately, CAST enables seamless joint optimization during training, while progressively transforming the model into the desired sparsity format. Specifically, CAST introduces three key components: 1) AdamS, a sparsity-aware optimizer that leverages adaptive L1 decay to promote uniform sparsification across all parameters; 2) Weight Scaling, a module designed to mitigate the magnitude reduction caused by decay while preserving desired sparsity patterns; 3) Knowledge Distillation, which employs the dense model as a self-teacher to enhance training efficiency. We evaluate CAST under 2:4 sparsity patterns across multiple model families, ranging from 125M to 13B parameters. Our results demonstrate significant improvements over previous state-of-the-art methods in both perplexity and zero-shot accuracy with minimal training resources. Notably, on LLaMA2-7B, our 2:4 sparse model achieves a negligible perplexity increase of 0.09 and a 0.36% gain in zero-shot accuracy compared to the dense model using only 2% of the original pretraining tokens. Additionally, we establish an accurate and robust empirical scaling law to predict sparse model performance given adequate training resources. Finally, we demonstrate the practical applicability of our sparse models by evaluating them under quantization and fine-tuning scenarios.
中文: CAST作为一种完全连续的自适应稀疏训练框架,通过联合优化稀疏模式和权重,在极少量训练资源下实现了半结构化稀疏模型的卓越性能。
English: CAST is a fully continuous sparsity-aware training framework that enables joint optimization of sparsity patterns and weights, achieving superior performance in semi-structured sparse models with minimal training resources.

Authors:Yingqian Cui, Zhenwei Dai, Pengfei He, Bing He, Hui Liu, Xianfeng Tang, Jingying Zeng, Suhang Wang, Yue Xing, Jiliang Tang, Benoit Dumoulin
Title: Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search
Abstract:
Large Language Models (LLMs) have achieved significant advances in reasoning tasks. A key approach is tree-based search with verifiers, which expand candidate reasoning paths and use reward models to guide pruning and selection. Although effective in improving accuracy, these methods are not optimal in terms of efficiency: they perform simple decomposition on the reasoning process, but ignore the planning-execution nature of tasks such as math reasoning or code generation. This results in inefficient exploration of reasoning process. To address this, we propose a dual-phase test-time scaling framework that explicitly separates reasoning into planning and execution, and performs search over the two phases individually. Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately. We further introduce a dynamic budget allocation mechanism that adaptively redistributes sampling effort based on reward feedback, allowing early stopping on confident steps and reallocation of computation to more challenging parts of the reasoning process. Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing redundant computation.
Chinese: 提出的双阶段测试时扩展框架通过分离规划与执行阶段,采用独立奖励模型和动态预算分配机制,在减少冗余计算的同时持续提升推理准确性。
English: The proposed dual-phase test-time scaling framework enhances reasoning efficiency by separating planning and execution phases, using individual reward models and dynamic budget allocation to improve accuracy while reducing redundant computation.

Authors:Bingrui Li, Jiaxin Wen, Zhanpeng Zhou, Jun Zhu, Jianfei Chen
Title: Efficient Hyperparameter Tuning via Trajectory Invariance Principle
Abstract:
As hyperparameter tuning becomes increasingly costly at scale, efficient tuning methods are essential. Yet principles for guiding hyperparameter tuning remain limited. In this work, we seek to establish such principles by considering a broad range of hyperparameters, including batch size, learning rate, and weight decay. We identify a phenomenon we call trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance--closely overlapping--with respect to a quantity that combines learning rate and weight decay. This phenomenon effectively reduces the original two-dimensional hyperparameter space to one dimension, yielding an efficient tuning rule: follow the salient direction revealed by trajectory invariance. Furthermore, we refine previous scaling laws and challenge several existing viewpoints. Overall, our work proposes new principles for efficient tuning and inspires future research on scaling laws.
Chinese: 本研究提出了轨迹不变性现象,通过将学习率和权重衰减的二维超参数空间简化为一维,提供了一种高效的调优规则,并改进了缩放定律,为未来研究提供了新指导原则。
English: This study introduces trajectory invariance, a phenomenon that simplifies hyperparameter tuning by reducing the two-dimensional space of learning rate and weight decay into one dimension, offering an efficient tuning rule and refining scaling laws to guide future research.

Authors:Hang Li, Kaiqi Yang, Yucheng Chu, Hui Liu, Jiliang Tang
Title: Exploring Solution Divergence and Its Effect on Large Language Model Problem Solving
Abstract:
Large language models (LLMs) have been widely used for problem-solving tasks. Most recent work improves their performance through supervised fine-tuning (SFT) with labeled data or reinforcement learning (RL) from task feedback. In this paper, we study a new perspective: the divergence in solutions generated by LLMs for a single problem. We show that higher solution divergence is positively related to better problem-solving abilities across various models. Based on this finding, we propose solution divergence as a novel metric that can support both SFT and RL strategies. We test this idea on three representative problem domains and find that using solution divergence consistently improves success rates. These results suggest that solution divergence is a simple but effective tool for advancing LLM training and evaluation.
The study reveals that greater solution divergence in large language models correlates with enhanced problem-solving abilities, proposing it as an effective metric to improve training and evaluation methods.
English Summary:

Authors:Kun Zhu, Lizi Liao, Yuxuan Gu, Lei Huang, Xiaocheng Feng, Bing Qin
Title: Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering
Abstract:
The rapid growth of scientific literature demands efficient methods to organize and synthesize research findings. Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models (LLMs), often lack coherence and granularity. We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering. Our method leverages LLMs to identify key aspects of each paper (e.g., methodology, dataset, evaluation) and generates aspect-specific paper summaries, which are then encoded and clustered along each aspect to form a coherent hierarchy. In addition, we introduce a new evaluation benchmark of 156 expert-crafted taxonomies encompassing 11.6k papers, providing the first naturally annotated dataset for this task. Experimental results demonstrate that our method significantly outperforms prior approaches, achieving state-of-the-art performance in taxonomy coherence, granularity, and interpretability.
中文: 我们提出的情境感知分层框架结合大语言模型的多维度编码与动态聚类技术,在156个专家构建的分类体系基准测试中显著超越了现有方法,在分类连贯性、精细度和可解释性方面均达到最优性能。
English: Our novel context-aware hierarchical framework integrates LLM-guided multi-aspect encoding with dynamic clustering to significantly outperform existing methods in taxonomy coherence, granularity, and interpretability, as validated against a new benchmark of 156 expert-crafted taxonomies.

Authors:Xingkun Yin, Kaibin Huang, Dong In Kim, Hongyang Du
Title: Experience Scaling: Post-Deployment Evolution For Large Language Models
Abstract:
Scaling model size, training data, and compute power have driven advances in large language models (LLMs), but these approaches are reaching saturation as human-generated text is exhausted and further gains diminish. We propose experience scaling, a framework for continuous post-deployment evolution for LLMs through autonomous interaction with the environment and collaborative sharing of accumulated experience. The framework captures raw interactions, distills them into compact, reusable knowledge, and periodically refines stored content to preserve relevance and efficiency. We validate the framework in simulated real-world scenarios involving generalization to previously unseen but related tasks, repetitive queries, and over-saturated knowledge stores. Across all settings, experience scaling improves accuracy, sustains performance over time, and maintains gains when applied to novel situations. These results demonstrate that structured post-deployment learning can extend LLM capabilities beyond the limits of static human-generated data, offering a scalable path for continued intelligence progress.
中文: 经验扩展框架通过让大语言模型在部署后自主与环境交互并共享提炼的知识,使其持续进化,从而在静态训练数据限制之外的新任务中提高准确性并维持性能表现。
English: Experience scaling enables large language models to evolve autonomously post-deployment by interacting with their environment and sharing distilled knowledge, which enhances accuracy and sustains performance across novel tasks beyond static training data limitations.

Authors:Yanjie Fu, Dongjie Wang, Wangyang Ying, Xiangliang Zhang, Huan Liu, Jian Pei
Title: Autonomous Data Agents: A New Opportunity for Smart Data
Abstract:
As data continues to grow in scale and complexity, preparing, transforming, and analyzing it remains labor-intensive, repetitive, and difficult to scale. Since data contains knowledge and AI learns knowledge from it, the alignment between AI and data is essential. However, data is often not structured in ways that are optimal for AI utilization. Moreover, an important question arises: how much knowledge can we pack into data through intensive data operations? Autonomous data agents (DataAgents), which integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling, can autonomously interpret data task descriptions, decompose tasks into subtasks, reason over actions, ground actions into python code or tool calling, and execute operations. Unlike traditional data management and engineering tools, DataAgents dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale. This report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. DataAgents are capable of handling collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval. Through these capabilities, DataAgents transform complex and unstructured data into coherent and actionable knowledge. We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend. We then define the concept of DataAgents and discuss their architectural design, training strategies, as well as the new skills and capabilities they enable. Finally, we call for concerted efforts to advance action workflow optimization, establish open datasets and benchmark ecosystems, safeguard privacy, balance efficiency with scalability, and develop trustworthy DataAgent guardrails to prevent malicious actions.
中文: 自主数据代理(DataAgents)通过集成大语言模型推理与任务分解,实现了将复杂数据转化为可操作知识的范式转变,为规模化数据处理提供了动态自适应的解决方案。
English: Autonomous data agents (DataAgents) represent a paradigm shift by leveraging LLM reasoning to dynamically transform complex data into actionable knowledge through automated workflows, addressing scalability and efficiency challenges in data processing.

Authors:Yanjie Fu, Dongjie Wang, Wangyang Ying, Xinyuan Wang, Xiangliang Zhang, Huan Liu, Jian Pei
Title: Autonomous Data Agents: A New Opportunity for Smart Data
Abstract:
As data continues to grow in scale and complexity, preparing, transforming, and analyzing it remains labor-intensive, repetitive, and difficult to scale. Since data contains knowledge and AI learns knowledge from it, the alignment between AI and data is essential. However, data is often not structured in ways that are optimal for AI utilization. Moreover, an important question arises: how much knowledge can we pack into data through intensive data operations? Autonomous data agents (DataAgents), which integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling, can autonomously interpret data task descriptions, decompose tasks into subtasks, reason over actions, ground actions into python code or tool calling, and execute operations. Unlike traditional data management and engineering tools, DataAgents dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale. This report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. DataAgents are capable of handling collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval. Through these capabilities, DataAgents transform complex and unstructured data into coherent and actionable knowledge. We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend. We then define the concept of DataAgents and discuss their architectural design, training strategies, as well as the new skills and capabilities they enable. Finally, we call for concerted efforts to advance action workflow optimization, establish open datasets and benchmark ecosystems, safeguard privacy, balance efficiency with scalability, and develop trustworthy DataAgent guardrails to prevent malicious actions.
中文: 自主数据代理(DataAgents)通过集成大语言模型推理与任务分解,实现了将复杂数据转化为可操作知识的范式转变,为规模化数据处理提供了动态自适应的解决方案。
English: Autonomous data agents (DataAgents) represent a paradigm shift by leveraging LLM reasoning to dynamically transform complex data into actionable knowledge through automated workflows, addressing scalability and efficiency challenges in data processing.

Authors:Xiaoting Yin, Hao Shi, Kailun Yang, Jiajun Zhai, Shangwei Guo, Lin Wang, Kaiwei Wang
Title: Event-guided 3D Gaussian Splatting for Dynamic Human and Scene Reconstruction
Abstract:
Reconstructing dynamic humans together with static scenes from monocular videos remains difficult, especially under fast motion, where RGB frames suffer from motion blur. Event cameras exhibit distinct advantages, e.g., microsecond temporal resolution, making them a superior sensing choice for dynamic human reconstruction. Accordingly, we present a novel event-guided human-scene reconstruction framework that jointly models human and scene from a single monocular event camera via 3D Gaussian Splatting. Specifically, a unified set of 3D Gaussians carries a learnable semantic attribute; only Gaussians classified as human undergo deformation for animation, while scene Gaussians stay static. To combat blur, we propose an event-guided loss that matches simulated brightness changes between consecutive renderings with the event stream, improving local fidelity in fast-moving regions. Our approach removes the need for external human masks and simplifies managing separate Gaussian sets. On two benchmark datasets, ZJU-MoCap-Blur and MMHPSD-Blur, it delivers state-of-the-art human-scene reconstruction, with notable gains over strong baselines in PSNR/SSIM and reduced LPIPS, especially for high-speed subjects.
中文摘要:本文提出了一种新颖的事件引导框架,通过3D高斯泼溅技术从单目事件相机视频中联合重建动态人体与静态场景,在快速运动场景下实现了最先进的性能并显著提升了视觉保真度。
English Summary: This paper introduces a novel event-guided framework that jointly reconstructs dynamic humans and static scenes from monocular event camera videos using 3D Gaussian Splatting, achieving state-of-the-art performance with improved visual fidelity in fast-motion scenarios.

Authors:Massa Baali, Sarthak Bisht, Francisco Teixeira, Kateryna Shapovalenko, Rita Singh, Bhiksha Raj
Title: SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions
Abstract:
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.
中文摘要:SVeritas是首个全面评估说话人验证系统在多样化现实压力下性能的基准套件,揭示了跨语言测试、年龄差异和编解码压缩导致的性能下降,同时发现了不同人口统计群体间的鲁棒性差异。
English Summary: SVeritas is the first comprehensive benchmark suite that evaluates speaker verification systems under diverse real-world stressors, revealing performance degradation in cross-language trials, age mismatches, and codec compression while identifying demographic disparities in robustness.

Authors:Massa Baali, Sarthak Bisht, Francisco Teixeira, Kateryna Shapovalenko, Rita Singh, Bhiksha Raj
Title: SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions
Abstract:
Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.
中文摘要:SVeritas是首个全面评估说话人验证系统在多样化现实压力下性能的基准套件,揭示了跨语言测试、年龄差异和编解码压缩导致的性能下降,同时发现了不同人口统计群体间的鲁棒性差异。
English Summary: SVeritas is the first comprehensive benchmark suite that evaluates speaker verification systems under diverse real-world stressors, revealing performance degradation in cross-language trials, age mismatches, and codec compression while identifying demographic disparities in robustness.

Authors:Qingyu Liu, Yushen Chen, Zhikang Niu, Chunhui Wang, Yunting Yang, Bowen Zhang, Jian Zhao, Pengcheng Zhu, Kai Yu, Xie Chen
Title: Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis
Abstract:
Flow-matching-based text-to-speech (TTS) models have shown high-quality speech synthesis. However, most current flow-matching-based TTS models still rely on reference transcripts corresponding to the audio prompt for synthesis. This dependency prevents cross-lingual voice cloning when audio prompt transcripts are unavailable, particularly for unseen languages. The key challenges for flow-matching-based TTS models to remove audio prompt transcripts are identifying word boundaries during training and determining appropriate duration during inference. In this paper, we introduce Cross-Lingual F5-TTS, a framework that enables cross-lingual voice cloning without audio prompt transcripts. Our method preprocesses audio prompts by forced alignment to obtain word boundaries, enabling direct synthesis from audio prompts while excluding transcripts during training. To address the duration modeling challenge, we train speaking rate predictors at different linguistic granularities to derive duration from speaker pace. Experiments show that our approach matches the performance of F5-TTS while enabling cross-lingual voice cloning.
中文总结:本文提出的跨语言F5-TTS框架通过强制对齐获取词边界,并利用多粒度语速预测器进行时长建模,实现了无需音频提示文本的跨语言语音克隆。
English Summary: This paper introduces Cross-Lingual F5-TTS, a framework that enables cross-lingual voice cloning without requiring audio prompt transcripts by using forced alignment for word boundaries and speaking rate predictors for duration modeling.

Authors:Shiyao Cui, Xijia Feng, Yingkang Wang, Junxiao Yang, Zhexin Zhang, Biplab Sikdar, Hongning Wang, Han Qiu, Minlie Huang
Title: When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity
Abstract:
Emojis are globally used non-verbal cues in digital communication, and extensive research has examined how large language models (LLMs) understand and utilize emojis across contexts. While usually associated with friendliness or playfulness, it is observed that emojis may trigger toxic content generation in LLMs. Motivated by such a observation, we aim to investigate: (1) whether emojis can clearly enhance the toxicity generation in LLMs and (2) how to interpret this phenomenon. We begin with a comprehensive exploration of emoji-triggered LLM toxicity generation by automating the construction of prompts with emojis to subtly express toxic intent. Experiments across 5 mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate that prompts with emojis could easily induce toxicity generation. To understand this phenomenon, we conduct model-level interpretations spanning semantic cognition, sequence generation and tokenization, suggesting that emojis can act as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue deeper insights, we further probe the pre-training corpus and uncover potential correlation between the emoji-related data polution with the toxicity generation behaviors. Supplementary materials provide our implementation code and data. (Warning: This paper contains potentially sensitive contents)
中文: 研究表明,表情符号可能通过绕过安全机制的异质语义通道,轻易诱导大型语言模型生成有害内容,这一现象在跨语言和模型的实验中得到了验证。
English: Emojis can inadvertently trigger toxic content generation in large language models by acting as a semantic channel that bypasses safety mechanisms, as demonstrated through experiments across multiple languages and models.

Authors:Xiao Li, Qi Chen, Xiulian Peng, Kai Yu, Xie Chen, Yan Lu
Title: Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video
Abstract:
We propose a novel and general framework to disentangle video data into its dynamic motion and static content components. Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works: it utilizes a transformer-based architecture to jointly generate flexible implicit features for frame-wise motion and clip-wise content, and incorporates a low-bitrate vector quantization as an information bottleneck to promote disentanglement and form a meaningful discrete motion space. The bitrate-controlled latent motion and content are used as conditional inputs to a denoising diffusion model to facilitate self-supervised representation learning. We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks. Furthermore, we also show that our method can generalize to other types of video data, such as pixel sprites of 2D cartoon characters. Our work presents a new perspective on self-supervised learning of disentangled video representations, contributing to the broader field of video analysis and generation.
中文: 本研究提出了一种自监督框架,通过基于Transformer的架构和低比特率向量量化将视频数据解耦为动态运动和静态内容,并在多种视频类型上通过运动迁移和生成任务验证了其有效性。
English: This study introduces a self-supervised framework that disentangles video data into dynamic motion and static content using a transformer-based architecture and low-bitrate vector quantization, validated through motion transfer and generation tasks on diverse video types.

Authors:Wei Chu, Yuanzhe Dong, Ke Tan, Dong Han, Xavier Menendez-Pidal, Ruchao Fan, Chenfeng Miao, Chanwoo Kim, Bhiksha Raj, Rita Singh
Title: OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics
Abstract:
OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.
中文:OleSpeech-IV数据集是一个大规模、多语种的对话语音集合,源自公开音频内容,包含人工精校的转录和说话人信息,并开放了非商业研究用的子集。
English: The OleSpeech-IV dataset is a large-scale, multilingual conversational speech collection from diverse public sources, featuring human-refined transcripts and speaker details, with a subset available for non-commercial research.

Authors:Yiwei Guo, Bohan Li, Hankun Wang, Zhihan Li, Shuai Wang, Xie Chen, Kai Yu
Title: AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions
Abstract:
Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from instruction sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.
Chinese: 当前大型音频语言模型常受指令敏感性困扰,而提出的AHAMask方法通过选择性屏蔽注意力头来无需指令即可激活听觉功能,在各类任务中实现相当甚至更优的性能表现。
English: Current large audio language models often suffer from instruction sensitivity, but the proposed AHAMask method addresses this by selectively masking attention heads to trigger acoustic functionalities without instructions, achieving comparable or better performance across tasks.

Authors:Peiran Liu, Qiang Zhang, Daojie Peng, Lingfeng Zhang, Yihao Qin, Hang Zhou, Jun Ma, Renjing Xu, Yiding Ji
Title: TopoNav: Topological Graphs as a Key Enabler for Advanced Object Navigation
Abstract:
Object Navigation (ObjectNav) has made great progress with large language models (LLMs), but still faces challenges in memory management, especially in long-horizon tasks and dynamic scenes. To address this, we propose TopoNav, a new framework that leverages topological structures as spatial memory. By building and updating a topological graph that captures scene connections, adjacency, and semantic meaning, TopoNav helps agents accumulate spatial knowledge over time, retrieve key information, and reason effectively toward distant goals. Our experiments show that TopoNav achieves state-of-the-art performance on benchmark ObjectNav datasets, with higher success rates and more efficient paths. It particularly excels in diverse and complex environments, as it connects temporary visual inputs with lasting spatial understanding.
Chinese: TopoNav提出了一种拓扑框架,通过构建和更新场景图来增强物体导航中的空间记忆,使智能体在复杂环境中实现更优的推理并达到最先进的性能。
English: TopoNav introduces a topological framework to enhance spatial memory in Object Navigation, enabling agents to build and update scene graphs for improved reasoning and achieving state-of-the-art performance in complex environments.

Authors:Wenxiao Wu, Jing-Hao Xue, Chengming Xu, Chen Liu, Xinwei Sun, Changxin Gao, Nong Sang, Yanwei Fu
Title: Towards Reliable and Holistic Visual In-Context Learning Prompt Selection
Abstract:
Visual In-Context Learning (VICL) has emerged as a prominent approach for adapting visual foundation models to novel tasks, by effectively exploiting contextual information embedded in in-context examples, which can be formulated as a global ranking problem of potential candidates. Current VICL methods, such as Partial2Global and VPR, are grounded in the similarity-priority assumption that images more visually similar to a query image serve as better in-context examples. This foundational assumption, while intuitive, lacks sufficient justification for its efficacy in selecting optimal in-context examples. Furthermore, Partial2Global constructs its global ranking from a series of randomly sampled pairwise preference predictions. Such a reliance on random sampling can lead to incomplete coverage and redundant samplings of comparisons, thus further adversely impacting the final global ranking. To address these issues, this paper introduces an enhanced variant of Partial2Global designed for reliable and holistic selection of in-context examples in VICL. Our proposed method, dubbed RH-Partial2Global, leverages a jackknife conformal prediction-guided strategy to construct reliable alternative sets and a covering design-based sampling approach to ensure comprehensive and uniform coverage of pairwise preferences. Extensive experiments demonstrate that RH-Partial2Global achieves excellent performance and outperforms Partial2Global across diverse visual tasks.
中文摘要:本文提出RH-Partial2Global这一改进的视觉上下文学习方法,通过刀切法共形预测构建可靠备选集,并采用覆盖设计采样确保成对偏好的全面覆盖,在多种视觉任务中展现出优于现有方法的性能。
English Summary: This paper introduces RH-Partial2Global, an enhanced VICL method that employs jackknife conformal prediction and covering design-based sampling to improve in-context example selection by ensuring reliable alternative sets and comprehensive pairwise preference coverage, demonstrating superior performance over existing methods.

Authors:Jingzhe Liu, Liam Collins, Jiliang Tang, Tong Zhao, Neil Shah, Clark Mingxuan Ju
Title: Understanding Generative Recommendation with Semantic IDs from a Model-scaling View
Abstract:
Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generative Recommendation (GR), which tries to unify rich item semantics and collaborative filtering signals. One popular modern approach is to use semantic IDs (SIDs), which are discrete codes quantized from the embeddings of modality encoders (e.g., large language or vision models), to represent items in an autoregressive user interaction sequence modeling setup (henceforth, SID-based GR). While generative models in other domains exhibit well-established scaling laws, our work reveals that SID-based GR shows significant bottlenecks while scaling up the model. In particular, the performance of SID-based GR quickly saturates as we enlarge each component: the modality encoder, the quantization tokenizer, and the RS itself. In this work, we identify the limited capacity of SIDs to encode item semantic information as one of the fundamental bottlenecks. Motivated by this observation, as an initial effort to obtain GR models with better scaling behaviors, we revisit another GR paradigm that directly uses large language models (LLMs) as recommenders (henceforth, LLM-as-RS). Our experiments show that the LLM-as-RS paradigm has superior model scaling properties and achieves up to 20 percent improvement over the best achievable performance of SID-based GR through scaling. We also challenge the prevailing belief that LLMs struggle to capture collaborative filtering information, showing that their ability to model user-item interactions improves as LLMs scale up. Our analyses on both SID-based GR and LLMs across model sizes from 44M to 14B parameters underscore the intrinsic scaling limits of SID-based GR and position LLM-as-RS as a promising path toward foundation models for GR.
中文摘要:基于语义ID的生成式推荐因语义编码能力有限而面临扩展瓶颈,而直接使用大语言模型作为推荐器则展现出更优的扩展性,性能提升最高达20%。
English Summary: Generative recommendation using semantic IDs faces scaling bottlenecks due to limited semantic encoding capacity, while directly employing large language models as recommenders demonstrates superior scaling and up to 20% performance improvement.

Authors:Efthymios Tsaprazlis, Tiantian Feng, Anil Ramakrishna, Rahul Gupta, Shrikanth Narayanan
Title: Assessing Visual Privacy Risks in Multimodal AI: A Novel Taxonomy-Grounded Evaluation of Vision-Language Models
Abstract:
Artificial Intelligence have profoundly transformed the technological landscape in recent years. Large Language Models (LLMs) have demonstrated impressive abilities in reasoning, text comprehension, contextual pattern recognition, and integrating language with visual understanding. While these advances offer significant benefits, they also reveal critical limitations in the models' ability to grasp the notion of privacy. There is hence substantial interest in determining if and how these models can understand and enforce privacy principles, particularly given the lack of supporting resources to test such a task. In this work, we address these challenges by examining how legal frameworks can inform the capabilities of these emerging technologies. To this end, we introduce a comprehensive, multi-level Visual Privacy Taxonomy that captures a wide range of privacy issues, designed to be scalable and adaptable to existing and future research needs. Furthermore, we evaluate the capabilities of several state-of-the-art Vision-Language Models (VLMs), revealing significant inconsistencies in their understanding of contextual privacy. Our work contributes both a foundational taxonomy for future research and a critical benchmark of current model limitations, demonstrating the urgent need for more robust, privacy-aware AI systems.
中文: 本研究提出可扩展的视觉隐私分类法来评估AI模型对隐私的理解,在揭示现有视觉语言模型存在显著语境认知缺陷的同时,为开发具备隐私意识的人工智能系统提供了基础性研究框架。
English: This study introduces a scalable Visual Privacy Taxonomy to assess AI models' understanding of privacy, revealing significant gaps in current Vision-Language Models' contextual awareness while providing foundational tools for future privacy-aware AI development.

Authors:Liyang Chen, Tianze Zhou, Xu He, Boshi Tang, Zhiyong Wu, Yang Huang, Yang Wu, Zhongqian Sun, Wei Yang, Helen Meng
Title: StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing
Abstract:
The visual dubbing task aims to generate mouth movements synchronized with the driving audio, which has seen significant progress in recent years. However, two critical deficiencies hinder their wide application: (1) Audio-only driving paradigms inadequately capture speaker-specific lip habits, which fail to generate lip movements similar to the target avatar; (2) Conventional blind-inpainting approaches frequently produce visual artifacts when handling obstructions (e.g., microphones, hands), limiting practical deployment. In this paper, we propose StableDub, a novel and concise framework integrating lip-habit-aware modeling with occlusion-robust synthesis. Specifically, building upon the Stable-Diffusion backbone, we develop a lip-habit-modulated mechanism that jointly models phonemic audio-visual synchronization and speaker-specific orofacial dynamics. To achieve plausible lip geometries and object appearances under occlusion, we introduce the occlusion-aware training strategy by explicitly exposing the occlusion objects to the inpainting process. By incorporating the proposed designs, the model eliminates the necessity for cost-intensive priors in previous methods, thereby exhibiting superior training efficiency on the computationally intensive diffusion-based backbone. To further optimize training efficiency from the perspective of model architecture, we introduce a hybrid Mamba-Transformer architecture, which demonstrates the enhanced applicability in low-resource research scenarios. Extensive experimental results demonstrate that StableDub achieves superior performance in lip habit resemblance and occlusion robustness. Our method also surpasses other methods in audio-lip sync, video quality, and resolution consistency. We expand the applicability of visual dubbing methods from comprehensive aspects, and demo videos can be found at https://stabledub.github.io.
中文摘要:StableDub是一种创新的视觉配音框架,通过结合唇部习惯建模与遮挡鲁棒合成技术,能够生成同步的嘴部运动,同时解决了说话者特定唇部习惯捕捉不足和遮挡物产生视觉伪影的关键问题。
English Summary: StableDub is a novel visual dubbing framework that integrates lip-habit modeling with occlusion-robust synthesis to generate synchronized mouth movements while overcoming speaker-specific lip habit capture and visual artifact issues.

Authors:Jihwan Lee, Sean Foley, Thanathai Lertpetchpun, Kevin Huang, Yoonjeong Lee, Tiantian Feng, Louis Goldstein, Dani Byrd, Shrikanth Narayanan
Title: ARTI-6: Towards Six-dimensional Articulatory Speech Encoding
Abstract:
We propose ARTI-6, a compact six-dimensional articulatory speech encoding framework derived from real-time MRI data that captures crucial vocal tract regions including the velum, tongue root, and larynx. ARTI-6 consists of three components: (1) a six-dimensional articulatory feature set representing key regions of the vocal tract; (2) an articulatory inversion model, which predicts articulatory features from speech acoustics leveraging speech foundation models, achieving a prediction correlation of 0.87; and (3) an articulatory synthesis model, which reconstructs intelligible speech directly from articulatory features, showing that even a low-dimensional representation can generate natural-sounding speech. Together, ARTI-6 provides an interpretable, computationally efficient, and physiologically grounded framework for advancing articulatory inversion, synthesis, and broader speech technology applications. The source code and speech samples are publicly available.
中文: ARTI-6 是一种基于实时 MRI 数据的紧凑六维发音语音框架,包含从声学预测发音特征的相关性达 0.87 的逆推模型和重建清晰语音的合成模型,为语音技术提供了高效且可解释的工具。
English: ARTI-6 is a compact six-dimensional articulatory speech framework derived from real-time MRI data, featuring an inversion model that predicts articulatory features from acoustics with 0.87 correlation and a synthesis model that reconstructs intelligible speech, offering an efficient and interpretable tool for speech technology.

Authors:Feng-Qi Cui, Jinyang Huang, Anyang Tong, Ziyu Jia, Jie Zhang, Zhi Liu, Dan Guo, Jianwei Lu, Meng Wang
Title: Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization
Abstract:
Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.
中文摘要:本研究提出的个人独立性通用微动作识别框架通过特征层的时间-频率对齐模块和损失层的组不变正则化方法,有效解决了不同个体动作表现差异问题,在MA-52数据集上实现了优于现有方法的准确性和鲁棒性。
English Summary: The proposed Person Independence Universal Micro-action Recognition Framework addresses inter-person variability in micro-action recognition through feature-level temporal-frequency alignment and loss-level group-invariant regularization, demonstrating superior accuracy and robustness on the MA-52 dataset.

Authors:Feng-Qi Cui, Jinyang Huang, Anyang Tong, Ziyu Jia, Jie Zhang, Zhi Liu, Dan Guo, Jianwei Lu, Meng Wang
Title: Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization
Abstract:
Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.
中文摘要:本研究提出的个人独立性通用微动作识别框架通过特征层的时间-频率对齐模块和损失层的组不变正则化方法,有效解决了不同个体动作表现差异问题,在MA-52数据集上实现了优于现有方法的准确性和鲁棒性。
English Summary: The proposed Person Independence Universal Micro-action Recognition Framework addresses inter-person variability in micro-action recognition through feature-level temporal-frequency alignment and loss-level group-invariant regularization, demonstrating superior accuracy and robustness on the MA-52 dataset.

Authors:Kartik Teotia, Helge Rhodin, Mohit Mendiratta, Hyeongwoo Kim, Marc Habermann, Christian Theobalt
Title: Audio-Driven Universal Gaussian Head Avatars
Abstract:
We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject's global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.
Chinese: 本文提出首个通用逼真虚拟形象合成方法,通过将原始音频输入映射到同时编码几何与外观变化的潜在表情空间,生成具有精确口型同步和细腻表情细节的高度逼真虚拟形象。
English: This paper presents the first universal photorealistic avatar synthesis method that maps raw audio inputs into a latent expression space encoding both geometric and appearance variations, enabling highly realistic avatars with precise lip synchronization and nuanced expressive details.

Authors:Efthymios Tsaprazlis, Thanathai Lertpetchpun, Tiantian Feng, Sai Praneeth Karimireddy, Shrikanth Narayanan
Title: VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks
Abstract:
Voice anonymization aims to conceal speaker identity and attributes while preserving intelligibility, but current evaluations rely almost exclusively on Equal Error Rate (EER) that obscures whether adversaries can mount high-precision attacks. We argue that privacy should instead be evaluated in the low false-positive rate (FPR) regime, where even a small number of successful identifications constitutes a meaningful breach. To this end, we introduce VoxGuard, a framework grounded in differential privacy and membership inference that formalizes two complementary notions: User Privacy, preventing speaker re-identification, and Attribute Privacy, protecting sensitive traits such as gender and accent. Across synthetic and real datasets, we find that informed adversaries, especially those using fine-tuned models and max-similarity scoring, achieve orders-of-magnitude stronger attacks at low-FPR despite similar EER. For attributes, we show that simple transparent attacks recover gender and accent with near-perfect accuracy even after anonymization. Our results demonstrate that EER substantially underestimates leakage, highlighting the need for low-FPR evaluation, and recommend VoxGuard as a benchmark for evaluating privacy leakage.
中文: 语音匿名化应基于低误报率而非等误率进行评估,VoxGuard框架证明在此机制下隐私攻击和属性泄露更为严重,推荐将其作为隐私泄露的基准测试工具。
English: Voice anonymization should be evaluated using low false-positive rate metrics rather than Equal Error Rate, as demonstrated by the VoxGuard framework that reveals significantly stronger privacy attacks and attribute leakage under this regime.

Authors:Tianyue Wang, Shuang Yang, Shiguang Shan, Xilin Chen
Title: GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition
Abstract:
Visual speech recognition (VSR), also known as lip reading, is the task of recognizing speech from silent video. Despite significant advancements in VSR over recent decades, most existing methods pay limited attention to real-world visual challenges such as illumination variations, occlusions, blurring, and pose changes. To address these challenges, we propose GLip, a Global-Local Integrated Progressive framework designed for robust VSR. GLip is built upon two key insights: (i) learning an initial \textit{coarse} alignment between visual features across varying conditions and corresponding speech content facilitates the subsequent learning of \textit{precise} visual-to-speech mappings in challenging environments; (ii) under adverse conditions, certain local regions (e.g., non-occluded areas) often exhibit more discriminative cues for lip reading than global features. To this end, GLip introduces a dual-path feature extraction architecture that integrates both global and local features within a two-stage progressive learning framework. In the first stage, the model learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data, establishing a coarse yet semantically robust foundation. In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context across both spatial and temporal dimensions, refining the coarse representations into precise visual-speech mappings. Our framework uniquely exploits discriminative local regions through a progressive learning strategy, demonstrating enhanced robustness against various visual challenges and consistently outperforming existing methods on the LRS2 and LRS3 benchmarks. We further validate its effectiveness on a newly introduced challenging Mandarin dataset.
中文摘要:提出的GLip框架通过两阶段渐进学习整合全局与局部特征,有效提升视觉语音识别的鲁棒性,在多种视觉干扰条件下优于现有方法,并在多个基准测试中表现卓越。
English Summary: The proposed GLip framework enhances visual speech recognition by progressively integrating global and local features through a two-stage learning process, demonstrating superior robustness against visual challenges and outperforming existing methods on multiple benchmarks.

Authors:Tianyue Wang, Shuang Yang, Shiguang Shan, Xilin Chen
Title: GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition
Abstract:
Visual speech recognition (VSR), also known as lip reading, is the task of recognizing speech from silent video. Despite significant advancements in VSR over recent decades, most existing methods pay limited attention to real-world visual challenges such as illumination variations, occlusions, blurring, and pose changes. To address these challenges, we propose GLip, a Global-Local Integrated Progressive framework designed for robust VSR. GLip is built upon two key insights: (i) learning an initial coarse alignment between visual features across varying conditions and corresponding speech content facilitates the subsequent learning of precise visual-to-speech mappings in challenging environments; (ii) under adverse conditions, certain local regions (e.g., non-occluded areas) often exhibit more discriminative cues for lip reading than global features. To this end, GLip introduces a dual-path feature extraction architecture that integrates both global and local features within a two-stage progressive learning framework. In the first stage, the model learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data, establishing a coarse yet semantically robust foundation. In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context across both spatial and temporal dimensions, refining the coarse representations into precise visual-speech mappings. Our framework uniquely exploits discriminative local regions through a progressive learning strategy, demonstrating enhanced robustness against various visual challenges and consistently outperforming existing methods on the LRS2 and LRS3 benchmarks. We further validate its effectiveness on a newly introduced challenging Mandarin dataset.
中文摘要:提出的GLip框架通过两阶段渐进学习整合全局与局部特征,有效提升视觉语音识别的鲁棒性,在多种视觉干扰条件下优于现有方法,并在多个基准测试中表现卓越。
English Summary: The proposed GLip framework enhances visual speech recognition by progressively integrating global and local features through a two-stage learning process, demonstrating superior robustness against visual challenges and outperforming existing methods on multiple benchmarks.

Authors:Frederic Kirstein, Sonu Kumar, Terry Ruas, Bela Gipp
Title: Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions
Abstract:
Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.
中文:FRAME是一个模块化流程,将会议摘要重构为语义丰富任务,通过主题组织事实减少幻觉和遗漏;SCOPE则通过推理协议提升个性化,P-MESA评估框架能可靠衡量摘要质量。
English: FRAME is a modular pipeline that transforms meeting summarization into semantic enrichment, reducing hallucinations and omissions by organizing facts thematically, while SCOPE enhances personalization through a reasoning protocol, with P-MESA providing reliable evaluation of summary quality.

Authors:Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin
Title: FlowRL: Matching Reward Distributions for LLM Reasoning
Abstract:
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Chinese: FlowRL 提出了一种流平衡方法,通过匹配完整奖励分布来增强大型语言模型强化学习中的多样性和推理能力,在数学和代码推理任务中显著优于PPO和GRPO等奖励最大化方法。
English: FlowRL introduces a flow balancing method that matches the full reward distribution in LLM reinforcement learning, enhancing diversity and achieving superior performance over reward-maximizing approaches like PPO and GRPO in math and code reasoning tasks.

Authors:Xuekai Zhu, Daixuan Cheng, Dinghuai Zhang, Hengli Li, Kaiyan Zhang, Che Jiang, Youbang Sun, Ermo Hua, Yuxin Zuo, Xingtai Lv, Qizheng Zhang, Lin Chen, Fanghao Shao, Bo Xue, Yunchong Song, Zhenjie Yang, Ganqu Cui, Ning Ding, Jianfeng Gao, Xiaodong Liu, Bowen Zhou, Hongyuan Mei, Zhouhan Lin
Title: FlowRL: Matching Reward Distributions for LLM Reasoning
Abstract:
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (\eg, PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.
Chinese: FlowRL 提出了一种流平衡方法,通过匹配完整奖励分布来增强大型语言模型强化学习中的多样性和推理能力,在数学和代码推理任务中显著优于PPO和GRPO等奖励最大化方法。
English: FlowRL introduces a flow balancing method that matches the full reward distribution in LLM reinforcement learning, enhancing diversity and achieving superior performance over reward-maximizing approaches like PPO and GRPO in math and code reasoning tasks.

Authors:Geon Lee, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Kijung Shin, Neil Shah, Liam Collins
Title: Sequential Data Augmentation for Generative Recommendation
Abstract:
Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation.
中文摘要:数据增强是生成式推荐系统中至关重要但常被忽视的因素,显著影响模型性能,而提出的GenPAS框架通过原则性方法系统控制增强偏差,从而获得更优结果。
English Summary: Data augmentation is a critical but often overlooked factor in generative recommendation that significantly impacts model performance, and the proposed GenPAS framework provides a principled approach to systematically control augmentation biases for superior results.

Authors:Jonas Becker, Lars Benedikt Kaesberg, Niklas Bauer, Jan Philip Wahle, Terry Ruas, Bela Gipp
Title: MALLM: Multi-Agent Large Language Models Framework
Abstract:
Multi-agent debate (MAD) has demonstrated the ability to augment collective intelligence by scaling test-time compute and leveraging expertise. Current frameworks for multi-agent debate are often designed towards tool use, lack integrated evaluation, or provide limited configurability of agent personas, response generators, discussion paradigms, and decision protocols. We introduce MALLM (Multi-Agent Large Language Models), an open-source framework that enables systematic analysis of MAD components. MALLM offers more than 144 unique configurations of MAD, including (1) agent personas (e.g., Expert, Personality), (2) response generators (e.g., Critical, Reasoning), (3) discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g., Voting, Consensus). MALLM uses simple configuration files to define a debate. Furthermore, MALLM can load any textual Hugging Face dataset (e.g., MMLU-Pro, WinoGrande) and provides an evaluation pipeline for easy comparison of MAD configurations. MALLM enables researchers to systematically configure, run, and evaluate debates for their problems, facilitating the understanding of the components and their interplay.
中文: MALLM是一个开源框架,支持通过144多种可配置组件系统性地构建、运行和评估多智能体辩论,便于深入分析集体智能的协同机制。
English: MALLM is an open-source framework that enables systematic configuration, execution, and evaluation of multi-agent debates with over 144 customizable components, facilitating comprehensive analysis of their collective intelligence.

Authors:Qiuyu Chen, Xin Jin, Yue Song, Xihui Liu, Shuai Yang, Tao Yang, Ziqiang Li, Jianguo Huang, Yuntao Wei, Ba'ao Xie, Nicu Sebe, Wenjun, Zeng, Jooyeol Yun, Davide Abati, Mohamed Omran, Jaegul Choo, Amir Habibian, Auke Wiggers, Masato Kobayashi, Ning Ding, Toru Tamaki, Marzieh Gheisari, Auguste Genovesio, Yuheng Chen, Dingkun Liu, Xinyao Yang, Xinping Xu, Baicheng Chen, Dongrui Wu, Junhao Geng, Lexiang Lv, Jianxin Lin, Hanzhe Liang, Jie Zhou, Xuanxin Chen, Jinbao Wang, Can Gao, Zhangyi Wang, Zongze Li, Bihan Wen, Yixin Gao, Xiaohan Pan, Xin Li, Zhibo Chen, Baorui Peng, Zhongming Chen, Haoran Jin
Title: The 1st International Workshop on Disentangled Representation Learning for Controllable Generation (DRL4Real): Methods and Results
Abstract:
This paper reviews the 1st International Workshop on Disentangled Representation Learning for Controllable Generation (DRL4Real), held in conjunction with ICCV 2025. The workshop aimed to bridge the gap between the theoretical promise of Disentangled Representation Learning (DRL) and its application in realistic scenarios, moving beyond synthetic benchmarks. DRL4Real focused on evaluating DRL methods in practical applications such as controllable generation, exploring advancements in model robustness, interpretability, and generalization. The workshop accepted 9 papers covering a broad range of topics, including the integration of novel inductive biases (e.g., language), the application of diffusion models to DRL, 3D-aware disentanglement, and the expansion of DRL into specialized domains like autonomous driving and EEG analysis. This summary details the workshop's objectives, the themes of the accepted papers, and provides an overview of the methodologies proposed by the authors.
中文: DRL4Real研讨会旨在弥合解耦表示学习的理论与现实应用之间的差距,收录的9篇论文涵盖扩散模型和3D解耦等主题,应用于自动驾驶和脑电分析等领域。
English: The DRL4Real workshop at ICCV 2025 bridged theoretical Disentangled Representation Learning with practical applications, featuring 9 papers on topics like diffusion models and 3D-aware disentanglement for domains including autonomous driving and EEG analysis.

Authors:Xinwei Long, Kai Tian, Peng Xu, Guoli Jia, Jingxuan Li, Sa Yang, Yihua Shao, Kaiyan Zhang, Che Jiang, Hao Xu, Yang Liu, Jiaheng Ma, Bowen Zhou
Title: AdsQA: Towards Advertisement Video Understanding
Abstract:
Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos' traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold: (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 22.7 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our \texttt{ReAd-R}~achieves the state-of-the-art outperforming strong competitors equipped with long-chain reasoning capabilities by a clear margin.
中文: 本文提出AdsQA这一基于广告视频的新基准,用于评估大语言模型对复杂营销要素的感知能力,并开发了ReAd-R强化学习模型,该模型在推理任务中显著超越其他顶尖模型,实现了最先进的性能。
English: This paper introduces AdsQA, a novel benchmark using advertisement videos to evaluate large language models' ability to perceive complex marketing elements, and proposes ReAd-R, a reinforcement learning model that achieves state-of-the-art performance by outperforming other top models in reasoning tasks.

Authors:Yuheng Jiang, Chengcheng Guo, Yize Wu, Yu Hong, Shengkun Zhu, Zhehao Shen, Yingliang Zhang, Shaohui Jiao, Zhuo Su, Lan Xu, Marc Habermann, Christian Theobalt
Title: Topology-Aware Optimization of Gaussian Primitives for Human-Centric Volumetric Videos
Abstract:
Volumetric video is emerging as a key medium for digitizing the dynamic physical world, creating the virtual environments with six degrees of freedom to deliver immersive user experiences. However, robustly modeling general dynamic scenes, especially those involving topological changes while maintaining long-term tracking remains a fundamental challenge. In this paper, we present TaoGS, a novel topology-aware dynamic Gaussian representation that disentangles motion and appearance to support, both, long-range tracking and topological adaptation. We represent scene motion with a sparse set of motion Gaussians, which are continuously updated by a spatio-temporal tracker and photometric cues that detect structural variations across frames. To capture fine-grained texture, each motion Gaussian anchors and dynamically activates a set of local appearance Gaussians, which are non-rigidly warped to the current frame to provide strong initialization and significantly reduce training time. This activation mechanism enables efficient modeling of detailed textures and maintains temporal coherence, allowing high-fidelity rendering even under challenging scenarios such as changing clothes. To enable seamless integration into codec-based volumetric formats, we introduce a global Gaussian Lookup Table that records the lifespan of each Gaussian and organizes attributes into a lifespan-aware 2D layout. This structure aligns naturally with standard video codecs and supports up to 40 compression. TaoGS provides a unified, adaptive solution for scalable volumetric video under topological variation, capturing moments where "elegance in motion" and "Power in Stillness", delivering immersive experiences that harmonize with the physical world.
中文: 体视频在拓扑变化动态场景建模中存在挑战,而TaoGS提出了一种拓扑感知的高斯表示法,通过解耦运动与外观实现稳定追踪和逼真渲染,支持高效压缩与沉浸式体验。
English: Volumetric video faces challenges in modeling dynamic scenes with topological changes, but TaoGS introduces a topology-aware Gaussian representation that disentangles motion and appearance for robust tracking and high-fidelity rendering.

Authors:Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, Bowen Zhou
Title: Towards a Unified View of Large Language Model Post-Training
Abstract:
Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.
中文: 本文提出了一个统一的理论框架,通过统一策略梯度估计器将在线和离线数据整合用于语言模型后训练,并提出了混合后训练(HPT)方法动态优化训练信号,在多个基准测试中验证了其优越性能。
English: This paper introduces a unified theoretical framework that integrates online and offline data for post-training language models through a Unified Policy Gradient Estimator, proposing Hybrid Post-Training (HPT) to dynamically optimize training signals and demonstrating its superior performance across multiple benchmarks.

Authors:Daocheng Fu, Jianbiao Mei, Licheng Wen, Xuemeng Yang, Cheng Yang, Rong Wu, Tao Hu, Siqi Li, Yufan Shen, Xinyu Cai, Pinlong Cai, Botian Shi, Yong Liu, Yu Qiao
Title: RE-Searcher: Robust Agentic Search with Goal-oriented Planning and Self-reflection
Abstract:
Large language models (LLMs) excel at knowledge-intensive question answering and reasoning, yet their real-world deployment remains constrained by knowledge cutoff, hallucination, and limited interaction modalities. Augmenting LLMs with external search tools helps alleviate these issues, but it also exposes agents to a complex search environment in which small, plausible variations in query formulation can steer reasoning into unproductive trajectories and amplify errors. We present a systematic analysis that quantifies how environmental complexity induces fragile search behaviors and, in turn, degrades overall performance. To address this challenge, we propose a simple yet effective approach to instantiate a search agent, RE-Searcher. During search, RE-Searcher explicitly articulates a concrete search goal and subsequently reflects on whether the retrieved evidence satisfies that goal. This combination of goal-oriented planning and self-reflection enables RE-Searcher to resist spurious cues in complex search environments and perform robust search. Extensive experiments show that our method improves search accuracy and achieves state-of-the-art results. Perturbation studies further demonstrate substantial resilience to noisy or misleading external signals, mitigating the fragility of the search process. We believe these findings offer practical guidance for integrating LLM-powered agents into more complex interactive environments and enabling more autonomous decision-making.
中文摘要:大型语言模型在复杂搜索环境中存在脆弱性,而RE-Searcher方法通过目标导向规划和自我反思相结合,有效提升了搜索准确性和抗干扰能力,实现了更稳健的搜索表现。
English Summary: Large language models face challenges in complex search environments, but the proposed RE-Searcher method improves robustness by combining goal-oriented planning with self-reflection to enhance search accuracy and resilience against noise.

Authors:Jiayi Kuang, Haojing Huang, Yinghui Li, Xinnian Liang, Zhikun Xu, Yangning Li, Xiaoyu Tan, Chao Qu, Meishan Zhang, Ying Shen, Philip S. Yu
Title: Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities
Abstract:
Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of "atomic thinking".
中文摘要:大型语言模型在数学原子能力上的评估揭示了其真实推理与记忆的区别,为更有效的训练策略提供了指导,涵盖多个数学领域和逻辑层次。
English Summary: Large Language Models are evaluated on their mathematical atomic capabilities across various fields and logical levels, revealing insights into their true reasoning versus memorization and guiding more effective training strategies.

Authors:Zekai Zhang, Mingwei Liu, Zhenxi Chen, Linxi Liang, Yuxuan Chen, Guangsheng Ou, Yanlin Wang, Dan Li, Xin Peng, Zibin Zheng
Title: Generating High-Quality Datasets for Code Editing via Open-Source Language Models
Abstract:
Code editing plays a vital role in software engineering, requiring developers to adjust existing code according to natural language instructions while keeping functionality intact and avoiding unnecessary modifications. However, commit-based datasets commonly used for this task are often noisy, lack diversity, and fail to reflect the style of real-world edit instructions. To address this, we introduce OpenCodeEdit, an open-source pipeline that leverages multiple LLMs to synthesize realistic code-edit triplets. The pipeline produces both concise "lazy" instructions and more detailed "descriptive" ones, and applies filtering based on diffs and topics to guarantee data quality and variety. Using this process, we construct OCEDataFT, a curated dataset of 20K samples. Fine-tuning three advanced base models on OCEDataFT leads to significant performance boosts on the CanItEdit benchmark, with relative pass@1 improvements ranging from 4.50% to 20.79%. Notably, the resulting models achieve performance close to closed-source systems, narrowing the gap to GPT-4 to just 3.54%, without relying on proprietary resources or manual annotation.
中文: OpenCodeEdit提出了一种开源流程,通过合成真实的代码编辑三元组构建了OCEDataFT数据集,该数据集显著提升了模型在基准测试中的表现,无需依赖专有资源即可接近GPT-4等闭源系统的性能。
English: OpenCodeEdit introduces an open-source pipeline that synthesizes realistic code-edit triplets, creating the OCEDataFT dataset which significantly boosts model performance on benchmarks, nearly matching closed-source systems like GPT-4 without proprietary resources.

Authors:Jinming Liu, Zhaoyang Jia, Jiahao Li, Bin Li, Xin Jin, Wenjun Zeng, Yan Lu
Title: When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs
Abstract:
The increasing deployment of powerful Multimodal Large Language Models (MLLMs), typically hosted on cloud platforms, urgently requires effective compression techniques to efficiently transmit signal inputs (e.g., images, videos) from edge devices with minimal bandwidth usage. However, conventional image codecs are optimized for fidelity to serve the Human Visual System (HVS) and ill-suited for MLLMs, in which diverse downstream tasks are jointly considered. In this paper, we first systematically analyze the impact of compression artifacts on several mainstream MLLMs. We find that: Compression distortion unevenly impacts different-level image features, leading to varying effects on MLLMs' downstream tasks depending on their feature-level reliance. Motivated by this discovery, we propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks. The encoder leverages CLIP's shallow-layer attention to generate an importance map for bit allocation, preserving critical semantic regions. Concurrently, the decoder integrates a lightweight adapter with a multi-level loss function to ensure the faithful reconstruction both of low-level details and high-level semantic context for robust synthesis of cross-level features. Extensive experiments validate that our method achieves up to 35.99\% bitrate saving while maintaining the same performance on the MLLM tasks, outperforming previous SOTA neural codecs.
中文: 该摘要提出了一种专为多模态大语言模型设计的图像编解码器CoTAM,它能自适应保护多层级特征,在保持任务性能的同时显著节省比特率,性能优于现有神经编解码器。
English: The abstract introduces CoTAM, an image codec designed for Multimodal Large Language Models that adaptively preserves multi-level features and achieves significant bitrate savings while maintaining task performance, outperforming existing neural codecs.

Authors:Chenyu Zhou, Xiaoming Shi, Hui Qiu, Xiawu Zheng, Haitao Leng, Yankai Jiang, Shaoguo Liu, Tingting Gao, Rongrong Ji
Title: Mix-Ecom: Towards Mixed-Type E-Commerce Dialogues with Complex Domain Rules
Abstract:
E-commerce agents contribute greatly to helping users complete their e-commerce needs. To promote further research and application of e-commerce agents, benchmarking frameworks are introduced for evaluating LLM agents in the e-commerce domain. Despite the progress, current benchmarks lack evaluating agents' capability to handle mixed-type e-commerce dialogue and complex domain rules. To address the issue, this work first introduces a novel corpus, termed Mix-ECom, which is constructed based on real-world customer-service dialogues with post-processing to remove user privacy and add CoT process. Specifically, Mix-ECom contains 4,799 samples with multiply dialogue types in each e-commerce dialogue, covering four dialogue types (QA, recommendation, task-oriented dialogue, and chit-chat), three e-commerce task types (pre-sales, logistics, after-sales), and 82 e-commerce rules. Furthermore, this work build baselines on Mix-Ecom and propose a dynamic framework to further improve the performance. Results show that current e-commerce agents lack sufficient capabilities to handle e-commerce dialogues, due to the hallucination cased by complex domain rules. The dataset will be publicly available.
中文: 本文提出了Mix-ECom数据集,基于真实电商对话构建,用于评估大语言模型代理处理混合对话类型和复杂领域规则的能力,揭示了当前模型在应对电商交互方面的不足。
English: This paper introduces Mix-ECom, a novel dataset derived from real-world e-commerce dialogues, to benchmark LLM agents' ability to handle mixed dialogue types and complex domain rules, revealing current limitations in managing e-commerce interactions.

Authors:Yue Liu, Yanjie Zhao, Yunbo Lyu, Ting Zhang, Haoyu Wang, David Lo
Title: "Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors
Abstract:
Agentic AI coding editors driven by large language models have recently become more popular due to their ability to improve developer productivity during software development. Modern editors such as Cursor are designed not just for code completion, but also with more system privileges for complex coding tasks (e.g., run commands in the terminal, access development environments, and interact with external systems). While this brings us closer to the "fully automated programming" dream, it also raises new security concerns. In this study, we present the first empirical analysis of prompt injection attacks targeting these high-privilege agentic AI coding editors. We show how attackers can remotely exploit these systems by poisoning external development resources with malicious instructions, effectively hijacking AI agents to run malicious commands, turning "your AI" into "attacker's shell". To perform this analysis, we implement AIShellJack, an automated testing framework for assessing prompt injection vulnerabilities in agentic AI coding editors. AIShellJack contains 314 unique attack payloads that cover 70 techniques from the MITRE ATT&CK framework. Using AIShellJack, we conduct a large-scale evaluation on GitHub Copilot and Cursor, and our evaluation results show that attack success rates can reach as high as 84% for executing malicious commands. Moreover, these attacks are proven effective across a wide range of objectives, ranging from initial access and system discovery to credential theft and data exfiltration.
中文: 基于大语言模型的智能AI编程编辑器虽然提升了开发效率,但存在提示注入攻击风险,攻击者可远程操控系统执行恶意指令,成功率高达84%。
English: Agentic AI coding editors, while enhancing developer productivity, are vulnerable to prompt injection attacks that can hijack the systems to execute malicious commands with high success rates.

Authors:Tianshuo Zhang, Li Gao, Siran Peng, Xiangyu Zhu, Zhen Lei
Title: DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces
Abstract:
The rise of realistic digital face generation and manipulation poses significant social risks. The primary challenge lies in the rapid and diverse evolution of generation techniques, which often outstrip the detection capabilities of existing models. To defend against the ever-evolving new types of forgery, we need to enable our model to quickly adapt to new domains with limited computation and data while avoiding forgetting previously learned forgery types. In this work, we posit that genuine facial samples are abundant and relatively stable in acquisition methods, while forgery faces continuously evolve with the iteration of manipulation techniques. Given the practical infeasibility of exhaustively collecting all forgery variants, we frame face forgery detection as a continual learning problem and allow the model to develop as new forgery types emerge. Specifically, we employ a Developmental Mixture of Experts (MoE) architecture that uses LoRA models as its individual experts. These experts are organized into two groups: a Real-LoRA to learn and refine knowledge of real faces, and multiple Fake-LoRAs to capture incremental information from different forgery types. To prevent catastrophic forgetting, we ensure that the learning direction of Fake-LoRAs is orthogonal to the established subspace. Moreover, we integrate orthogonal gradients into the orthogonal loss of Fake-LoRAs, preventing gradient interference throughout the training process of each task. Experimental results under both the datasets and manipulation types incremental protocols demonstrate the effectiveness of our method.
中文摘要:本文提出了一种持续学习方法,采用发展型专家混合架构和专用LoRA模块,使伪造人脸检测模型能够适应新型篡改技术,同时保留对已知伪造类型的识别能力。
English Summary: This paper proposes a continual learning approach using a Developmental Mixture of Experts with specialized LoRA modules to enable face forgery detection models to adapt to new manipulation techniques while preserving knowledge of previous forgery types.

Authors:Keliang Liu, Dingkang Yang, Ziyun Qian, Weijie Yin, Yuchi Wang, Hongsheng Li, Jun Liu, Peng Zhai, Yang Liu, Lihua Zhang
Title: Reinforcement Learning Meets Large Language Models: A Survey of Advancements and Applications Across the LLM Lifecycle
Abstract:
In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comprehensive summary of how RL operates across the full lifecycle of LLMs. We systematically review the theoretical and practical advancements whereby RL empowers LLMs, especially Reinforcement Learning with Verifiable Rewards (RLVR). First, we briefly introduce the basic theory of RL. Second, we thoroughly detail application strategies for RL across various phases of the LLM lifecycle, including pre-training, alignment fine-tuning, and reinforced reasoning. In particular, we emphasize that RL methods in the reinforced reasoning phase serve as a pivotal driving force for advancing model reasoning to its limits. Next, we collate existing datasets and evaluation benchmarks currently used for RL fine-tuning, spanning human-annotated datasets, AI-assisted preference data, and program-verification-style corpora. Subsequently, we review the mainstream open-source tools and training frameworks available, providing clear practical references for subsequent research. Finally, we analyse the future challenges and trends in the field of RL-enhanced LLMs. This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs, with the goal of fostering the evolution of LLMs that are more intelligent, generalizable, and secure.
中文摘要:强化学习显著提升了大语言模型的推理与对齐能力,本文系统综述了强化学习在LLM全生命周期中的作用,并分析了该领域的未来挑战与发展趋势。
English Summary: Reinforcement Learning significantly improves Large Language Models' reasoning and alignment capabilities, and this survey comprehensively reviews RL's role across the entire LLM lifecycle while analyzing future challenges and trends.

Authors:Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland
Title: Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Abstract:
Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.
中文摘要:本实证研究表明,扩散模型LLaDA作为Whisper-LLaMA的审议模块可显著降低语音识别的词错率,同时作为独立解码器在保持较快推理速度方面展现出潜力。
English Summary: This empirical study demonstrates that the diffusion-based LLaDA model significantly reduces word error rates when used as a deliberation module with Whisper-LLaMA for speech recognition, while also showing potential for faster inference as a standalone ASR decoder.

Authors:Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland
Title: Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
Abstract:
Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.
中文摘要:本实证研究表明,扩散模型LLaDA作为Whisper-LLaMA的审议模块可显著降低语音识别的词错率,同时作为独立解码器在保持较快推理速度方面展现出潜力。
English Summary: This empirical study demonstrates that the diffusion-based LLaDA model significantly reduces word error rates when used as a deliberation module with Whisper-LLaMA for speech recognition, while also showing potential for faster inference as a standalone ASR decoder.

Authors:An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu
Title: 3D Aware Region Prompted Vision Language Model
Abstract:
We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.
中文: SR-3D是一种通过共享视觉标记空间连接二维图像与三维数据的视觉语言模型,支持灵活的区域标注方式,无需多帧标注即可实现跨帧空间推理,并在各类基准测试中达到最优性能。
English: SR-3D is a vision-language model that integrates 2D images and 3D data through a shared token space, enabling flexible region annotation and achieving state-of-the-art performance in spatial reasoning without requiring multi-frame labeling.

Authors:Maciej Besta, Shriram Chandran, Robert Gerstenberger, Mathis Lindner, Marcin Chrapek, Sebastian Hermann Martschat, Taraneh Ghandi, Patrick Iff, Hubert Niewiadomski, Piotr Nyczyk, Jürgen Müller, Torsten Hoefler
Title: Psychologically Enhanced AI Agents
Abstract:
We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of Large Language Model (LLM) agents through psychologically grounded personality conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game-theoretic settings. Our framework supports experimenting with structured multi-agent communication protocols and reveals that self-reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and LLM behavior design, we establish a foundation for psychologically enhanced AI agents without any fine-tuning.
中文: MBTI-in-Thoughts框架通过基于MBTI的人格提示工程增强大语言模型代理,使其在叙事生成和策略推理等任务中表现出稳定且可解释的行为模式,无需微调即可实现心理理论驱动的AI行为设计。
English: The MBTI-in-Thoughts framework enhances LLM agents by integrating MBTI personality conditioning through prompt engineering, enabling consistent behavioral biases and improved performance in tasks like narrative generation and strategic reasoning without fine-tuning.

Authors:Wenjie Zhu, Yabin Zhang, Xin Jin, Wenjun Zeng, Lei Zhang
Title: ANTS: Shaping the Adaptive Negative Textual Space by MLLM for OOD Detection
Abstract:
The introduction of negative labels (NLs) has proven effective in enhancing Out-of-Distribution (OOD) detection. However, existing methods often lack an understanding of OOD images, making it difficult to construct an accurate negative space. In addition, the presence of false negative labels significantly degrades their near-OOD performance. To address these issues, we propose shaping an Adaptive Negative Textual Space (ANTS) by leveraging the understanding and reasoning capabilities of multimodal large language models (MLLMs). Specifically, we identify images likely to be OOD samples as negative images and prompt the MLLM to describe these images, generating expressive negative sentences that precisely characterize the OOD distribution and enhance far-OOD detection. For the near-OOD setting, where OOD samples resemble the in-distribution (ID) subset, we first identify the subset of ID classes that are visually similar to negative images and then leverage the reasoning capability of MLLMs to generate visually similar negative labels tailored to this subset, effectively reducing false negatives and improving near-OOD detection. To balance these two types of negative textual spaces, we design an adaptive weighted score that enables the method to handle different OOD task settings (near-OOD and far-OOD) without relying on task-specific prior knowledge, making it highly adaptable in open environments. On the ImageNet benchmark, our ANTS significantly reduces the FPR95 by 4.2\%, establishing a new state-of-the-art. Furthermore, our method is training-free and zero-shot, enabling high scalability.
中文: 提出的自适应负向文本空间方法利用多模态大语言模型生成精确的负向标签,有效提升远近分布外检测性能,在ImageNet基准上实现最优结果且无需训练、扩展性强。
English: The proposed Adaptive Negative Textual Space (ANTS) method leverages multimodal large language models to generate precise negative labels for both far and near out-of-distribution detection, achieving state-of-the-art performance on ImageNet while being training-free and highly scalable.

Authors:Miao Xu, Xiangyu Zhu, Xusheng Liang, Zidu Wang, Jinlin Wu, Zhen Lei
Title: Towards Realistic Hand-Object Interaction with Gravity-Field Based Diffusion Bridge
Abstract:
Existing reconstruction or hand-object pose estimation methods are capable of producing coarse interaction states. However, due to the complex and diverse geometry of both human hands and objects, these approaches often suffer from interpenetration or leave noticeable gaps in regions that are supposed to be in contact. Moreover, the surface of a real human hand undergoes non-negligible deformations during interaction, which are difficult to capture and represent with previous methods. To tackle these challenges, we formulate hand-object interaction as an attraction-driven process and propose a Gravity-Field Based Diffusion Bridge (GravityDB) to simulate interactions between a deformable hand surface and rigid objects. Our approach effectively resolves the aforementioned issues by generating physically plausible interactions that are free of interpenetration, ensure stable grasping, and capture realistic hand deformations. Furthermore, we incorporate semantic information from textual descriptions to guide the construction of the gravitational field, enabling more semantically meaningful interaction regions. Extensive qualitative and quantitative experiments on multiple datasets demonstrate the effectiveness of our method.
中文: 提出的基于引力场的扩散桥方法通过模拟吸引驱动过程并融入语义指导,有效解决了手物交互中的穿透、接触间隙和手部变形问题,实现了物理合理的交互效果。
English: The proposed Gravity-Field Based Diffusion Bridge (GravityDB) method addresses interpenetration, contact gaps, and hand deformation issues in hand-object interaction by simulating attraction-driven processes and incorporating semantic guidance for physically plausible results.

Authors:Lingkai Meng, Long Yuan, Xuemin Lin, Wenjie Zhang, Ying Zhang
Title: Triangle Counting in Hypergraph Streams: A Complete and Practical Approach
Abstract:
Triangle counting in hypergraph streams, including both hyper-vertex and hyper-edge triangles, is a fundamental problem in hypergraph analytics, with broad applications. However, existing methods face two key limitations: (i) an incomplete classification of hyper-vertex triangle structures, typically considering only inner or outer triangles; and (ii) inflexible sampling schemes that predefine the number of sampled hyperedges, which is impractical under strict memory constraints due to highly variable hyperedge sizes. To address these challenges, we first introduce a complete classification of hyper-vertex triangles, including inner, hybrid, and outer triangles. Based on this, we develop HTCount, a reservoir-based algorithm that dynamically adjusts the sample size based on the available memory M. To further improve memory utilization and reduce estimation error, we develop HTCount-P, a partition-based variant that adaptively partitions unused memory into independent sample subsets. We provide theoretical analysis of the unbiasedness and variance bounds of the proposed algorithms. Case studies demonstrate the expressiveness of our triangle structures in revealing meaningful interaction patterns. Extensive experiments on real-world hypergraphs show that both our algorithms achieve highly accurate triangle count estimates under strict memory constraints, with relative errors that are 1 to 2 orders of magnitude lower than those of existing methods and consistently high throughput.
中文: 本研究提出了超顶点三角形的完整分类,并开发了在内存限制下动态调整采样的HTCount和HTCount-P算法,相比现有方法实现了显著更高的精确度和吞吐量。
English: This study introduces a complete classification of hyper-vertex triangles and develops HTCount and HTCount-P algorithms that dynamically adjust sampling under memory constraints, achieving significantly higher accuracy and throughput than existing methods.

Authors:Miao Rang, Zhenni Bi, Hang Zhou, Hanting Chen, An Xiao, Tianyu Guo, Kai Han, Xinghao Chen, Yunhe Wang
Title: Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation
Abstract:
The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct deployment in resource-constrained edge environments. This creates a critical need for high-performance small models that can operate efficiently at the edge. Yet, after pre-training alone, these smaller models often fail to meet the performance requirements of complex tasks. To bridge this gap, we introduce a systematic post-training pipeline that efficiently enhances small model accuracy. Our post training pipeline consists of curriculum-based supervised fine-tuning (SFT) and offline on-policy knowledge distillation. The resulting instruction-tuned model achieves state-of-the-art performance among billion-parameter models, demonstrating strong generalization under strict hardware constraints while maintaining competitive accuracy across a variety of tasks. This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.
中文: 本文提出一种系统化的后训练流程,通过课程监督微调和离线策略知识蒸馏,使小型语言模型在边缘设备上实现十亿参数模型中的最优性能,为昇腾边缘设备提供了高效解决方案。
English: This paper introduces a systematic post-training pipeline that enhances small language models' performance for edge deployment, achieving state-of-the-art results among billion-parameter models through curriculum-based fine-tuning and knowledge distillation.

Authors:Xinbei Ma, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Mengru Wang, Jen-tse Huang, Qu Yang, Wenxuan Wang, Fanghua Ye, Qingxuan Jiang, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Hai Zhao, Zhaopeng Tu, Xiaolong Li, Linus
Title: The Hunger Game Debate: On the Emergence of Over-Competition in Multi-Agent Systems
Abstract:
LLM-based multi-agent systems demonstrate great potential for tackling complex problems, but how competition shapes their behavior remains underexplored. This paper investigates the over-competition in multi-agent debate, where agents under extreme pressure exhibit unreliable, harmful behaviors that undermine both collaboration and task performance. To study this phenomenon, we propose HATE, the Hunger Game Debate, a novel experimental framework that simulates debates under a zero-sum competition arena. Our experiments, conducted across a range of LLMs and tasks, reveal that competitive pressure significantly stimulates over-competition behaviors and degrades task performance, causing discussions to derail. We further explore the impact of environmental feedback by adding variants of judges, indicating that objective, task-focused feedback effectively mitigates the over-competition behaviors. We also probe the post-hoc kindness of LLMs and form a leaderboard to characterize top LLMs, providing insights for understanding and governing the emergent social dynamics of AI community.
中文: 本文通过HATE实验框架揭示多智能体在零和博弈辩论中会产生损害任务表现的过度竞争行为,并证明引入客观评判机制能有效缓解此类负面行为。
English: This paper introduces the HATE framework to study how zero-sum competition in multi-agent debates triggers harmful over-competition behaviors that degrade performance, and finds that objective feedback from judges can effectively mitigate these negative effects.

Authors:Gongxu Luo, Loka Li, Guangyi Chen, Haoyue Dai, Kun Zhang
Title: Characterization and Learning of Causal Graphs with Latent Confounders and Post-treatment Selection from Interventional Data
Abstract:
Interventional causal discovery seeks to identify causal relations by leveraging distributional changes introduced by interventions, even in the presence of latent confounders. Beyond the spurious dependencies induced by latent confounders, we highlight a common yet often overlooked challenge in the problem due to post-treatment selection, in which samples are selectively included in datasets after interventions. This fundamental challenge widely exists in biological studies; for example, in gene expression analysis, both observational and interventional samples are retained only if they meet quality control criteria (e.g., highly active cells). Neglecting post-treatment selection may introduce spurious dependencies and distributional changes under interventions, which can mimic causal responses, thereby distorting causal discovery results and challenging existing causal formulations. To address this, we introduce a novel causal formulation that explicitly models post-treatment selection and reveals how its differential reactions to interventions can distinguish causal relations from selection patterns, allowing us to go beyond traditional equivalence classes toward the underlying true causal structure. We then characterize its Markov properties and propose a Fine-grained Interventional equivalence class, named FI-Markov equivalence, represented by a new graphical diagram, F-PAG. Finally, we develop a provably sound and complete algorithm, F-FCI, to identify causal relations, latent confounders, and post-treatment selection up to $\mathcal{FI}$-Markov equivalence, using both observational and interventional data. Experimental results on synthetic and real-world datasets demonstrate that our method recovers causal relations despite the presence of both selection and latent confounders.
中文: 本文针对干预性因果发现中的处理后选择问题,提出了一种新框架和算法,能够区分真实因果关系与选择诱导模式,从而在存在潜在混杂因素的情况下准确恢复因果结构。
English: This paper addresses the challenge of post-treatment selection in interventional causal discovery, introducing a novel formulation and algorithm that distinguish true causal relations from selection-induced patterns to accurately recover causal structures despite latent confounders.

Authors:Kaizhen Zhu, Mokai Pan, Zhechuan Yu, Jingya Wang, Jingyi Yu, Ye Shi
Title: Diffusion Bridge or Flow Matching? A Unifying Framework and Comparative Analysis
Abstract:
Diffusion Bridge and Flow Matching have both demonstrated compelling empirical performance in transformation between arbitrary distributions. However, there remains confusion about which approach is generally preferable, and the substantial discrepancies in their modeling assumptions and practical implementations have hindered a unified theoretical account of their relative merits. We have, for the first time, provided a unified theoretical and experimental validation of these two models. We recast their frameworks through the lens of Stochastic Optimal Control and prove that the cost function of the Diffusion Bridge is lower, guiding the system toward more stable and natural trajectories. Simultaneously, from the perspective of Optimal Transport, interpolation coefficients $t$ and $1-t$ of Flow Matching become increasingly ineffective when the training data size is reduced. To corroborate these theoretical claims, we propose a novel, powerful architecture for Diffusion Bridge built on a latent Transformer, and implement a Flow Matching model with the same structure to enable a fair performance comparison in various experiments. Comprehensive experiments are conducted across Image Inpainting, Super-Resolution, Deblurring, Denoising, Translation, and Style Transfer tasks, systematically varying both the distributional discrepancy (different difficulty) and the training data size. Extensive empirical results align perfectly with our theoretical predictions and allow us to delineate the respective advantages and disadvantages of these two models. Our code is available at https://anonymous.4open.science/r/DBFM-3E8E/.
Chinese: 本研究首次对扩散桥和流匹配模型进行了统一的理论与实验验证,通过随机最优控制证明扩散桥具有更低成本函数以实现更稳定轨迹,同时基于最优传输理论揭示流匹配在训练数据减少时效果受限的缺陷。
English: This study provides the first unified theoretical and experimental validation comparing Diffusion Bridge and Flow Matching, demonstrating through Stochastic Optimal Control that Diffusion Bridge achieves lower cost functions for more stable trajectories while revealing Flow Matching's limitations with reduced training data through Optimal Transport analysis.

Authors:Yibo Yan, Guangwei Xu, Xin Zou, Shuliang Liu, James Kwok, Xuming Hu
Title: DocPruner: A Storage-Efficient Framework for Multi-Vector Visual Document Retrieval via Adaptive Patch-Level Embedding Pruning
Abstract:
Visual Document Retrieval (VDR), the task of retrieving visually-rich document pages using queries that combine visual and textual cues, is crucial for numerous real-world applications. Recent state-of-the-art methods leverage Large Vision-Language Models (LVLMs) in a multi-vector paradigm, representing each document as patch-level embeddings to capture fine-grained details. While highly effective, this approach introduces a critical challenge: prohibitive storage overhead, as storing hundreds of vectors per page makes large-scale deployment costly and impractical. To address this, we introduce DocPruner, the first framework to employ adaptive patch-level embedding pruning for VDR to effectively reduce the storage overhead. DocPruner leverages the intra-document patch attention distribution to dynamically identify and discard redundant embeddings for each document. This adaptive mechanism enables a significant 50-60% reduction in storage for leading multi-vector VDR models with negligible degradation in document retrieval performance. Extensive experiments across more than ten representative datasets validate that DocPruner offers a robust, flexible, and effective solution for building storage-efficient, large-scale VDR systems.
Chinese: DocPruner首次在视觉文档检索中采用自适应块级嵌入剪枝,通过基于文档内部注意力的动态冗余嵌入剔除,在检索性能几乎无损的情况下将存储开销降低50-60%。
English: DocPruner introduces adaptive patch-level embedding pruning for Visual Document Retrieval, reducing storage by 50-60% with minimal performance loss by dynamically eliminating redundant embeddings based on intra-document attention.

Authors:Jiajin Tang, Zhengxuan Wei, Yuchen Zhu, Cheng Shi, Guanbin Li, Liang Lin, Sibei Yang
Title: Sim-DETR: Unlock DETR for Temporal Sentence Grounding
Abstract:
Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research.
中文摘要:针对标准DETR在时序语句定位中存在的查询冲突问题,Sim-DETR通过约束自注意力机制和添加查询-帧对齐模块,有效提升了模型性能。
English Summary: Temporal sentence grounding with standard DETR suffers from query conflicts, which Sim-DETR resolves through constrained self-attention and query-to-frame alignment to significantly improve performance.

Authors:Zeren Xiong, Yue Yu, Zedong Zhang, Shuo Chen, Jian Yang, Jun Li
Title: VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis
Abstract:
Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose Visual Mixing Diffusion (VMDiff), a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a hybrid sampling process that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an efficient adaptive adjustment module, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.
中文摘要:本文提出视觉混合扩散(VMDiff)框架,通过混合采样和自适应参数调整,有效解决多源图像融合中的共存生成与语义偏差问题,在视觉质量、语义一致性和创意评价上优于现有方法。
English Summary: The paper introduces Visual Mixing Diffusion (VMDiff), a diffusion-based framework that effectively integrates multiple image sources to create coherent objects by addressing coexistent and bias generation challenges through hybrid sampling and adaptive parameter adjustment.

Authors:Wenxuan Fang, Jiangwei Weng, Jianjun Qian, Jian Yang, Jun Li
Title: WeatherCycle: Unpaired Multi-Weather Restoration via Color Space Decoupled Cycle Learning
Abstract:
Unsupervised image restoration under multi-weather conditions remains a fundamental yet underexplored challenge. While existing methods often rely on task-specific physical priors, their narrow focus limits scalability and generalization to diverse real-world weather scenarios. In this work, we propose \textbf{WeatherCycle}, a unified unpaired framework that reformulates weather restoration as a bidirectional degradation-content translation cycle, guided by degradation-aware curriculum regularization. At its core, WeatherCycle employs a \textit{lumina-chroma decomposition} strategy to decouple degradation from content without modeling complex weather, enabling domain conversion between degraded and clean images. To model diverse and complex degradations, we propose a \textit{Lumina Degradation Guidance Module} (LDGM), which learns luminance degradation priors from a degraded image pool and injects them into clean images via frequency-domain amplitude modulation, enabling controllable and realistic degradation modeling. Additionally, we incorporate a \textit{Difficulty-Aware Contrastive Regularization (DACR)} module that identifies hard samples via a CLIP-based classifier and enforces contrastive alignment between hard samples and restored features to enhance semantic consistency and robustness. Extensive experiments across serve multi-weather datasets, demonstrate that our method achieves state-of-the-art performance among unsupervised approaches, with strong generalization to complex weather degradations.
中文:WeatherCycle是一个统一的非配对框架,通过双向退化-内容转换和退化感知课程正则化解决多天气图像恢复问题,在无监督方法中实现了最先进的性能。
English: WeatherCycle is a unified unpaired framework that addresses multi-weather image restoration through bidirectional degradation-content translation and degradation-aware curriculum regularization, achieving state-of-the-art performance in unsupervised approaches.

Authors:Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, An Zhang
Title: Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Abstract:
Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory corpus that is dynamically updated during a single-pass document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from irreversible forward-only processing, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, a memory-augmented agent with callback-enhanced memory that allows selective retrieval from the entire memory history and allows non-linear reasoning and revisiting of early evidence. To further strengthen training, we propose Reinforcement Learning with Multi-Level Rewards (RLMLR), which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support multi-hop memory utilizing. Experiments on long-document QA show significant gains over existing memory-based approaches, which validates ReMemR1 as an effective solution for long-context reasoning agents.
中文: 大语言模型在长上下文问答中面临证据分散的挑战,而ReMemR1通过回调增强记忆和多层次强化学习实现了选择性检索和优化推理,显著提升了性能。
English: Large language models struggle with long-context question answering due to dispersed evidence, but ReMemR1 introduces callback-enhanced memory and multi-level reinforcement learning to enable selective retrieval and improve reasoning, achieving significant gains in performance.

Authors:Junkang Wu, Kexin Huang, Jiancan Wu, An Zhang, Xiang Wang, Xiangnan He
Title: Quantile Advantage Estimation for Entropy-Safe Reasoning
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) strengthens LLM reasoning, but training often oscillates between {entropy collapse} and {entropy explosion}. We trace both hazards to the mean baseline used in value-free RL (e.g., GRPO and DAPO), which improperly penalizes negative-advantage samples under reward outliers. We propose {Quantile Advantage Estimation} (QAE), replacing the mean with a group-wise K-quantile baseline. QAE induces a response-level, two-regime gate: on hard queries (p <= 1 - K) it reinforces rare successes, while on easy queries (p > 1 - K) it targets remaining failures. Under first-order softmax updates, we prove {two-sided entropy safety}, giving lower and upper bounds on one-step entropy change that curb explosion and prevent collapse. Empirically, this minimal modification stabilizes entropy, sparsifies credit assignment (with tuned K, roughly 80% of responses receive zero advantage), and yields sustained pass@1 gains on Qwen3-8B/14B-Base across AIME 2024/2025 and AMC 2023. These results identify {baseline design} -- rather than token-level heuristics -- as the primary mechanism for scaling RLVR.
中文摘要:提出的分位数优势估计(QAE)方法通过分位数基线替代无价值强化学习中的均值基线,在防止熵崩溃和爆炸的同时,显著提升了多个数学推理基准的持续性能表现。
English Summary: The proposed Quantile Advantage Estimation (QAE) method replaces the mean baseline in value-free reinforcement learning with a quantile-based approach, effectively preventing both entropy collapse and explosion while improving reasoning performance across multiple benchmarks.

Authors:Nikita Kornilov, David Li, Tikhon Mavrin, Aleksei Leonov, Nikita Gushchin, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin
Title: Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
Abstract:
While achieving exceptional generative quality, modern diffusion, flow, and other matching models suffer from slow inference, as they require many steps of iterative generation. Recent distillation methods address this by training efficient one-step generators under the guidance of a pre-trained teacher model. However, these methods are often constrained to only one specific framework, e.g., only to diffusion or only to flow models. Furthermore, these methods are naturally data-free, and to benefit from the usage of real data, it is required to use an additional complex adversarial training with an extra discriminator model. In this paper, we present RealUID, a universal distillation framework for all matching models that seamlessly incorporates real data into the distillation procedure without GANs. Our RealUID approach offers a simple theoretical foundation that covers previous distillation methods for Flow Matching and Diffusion models, and is also extended to their modifications, such as Bridge Matching and Stochastic Interpolants.
Chinese: RealUID是一种通用蒸馏框架,无需GAN即可将真实数据融入蒸馏过程,适用于多种匹配模型,超越了仅针对扩散或流模型等特定框架的限制。
English: RealUID is a universal distillation framework that enhances the efficiency of various matching models by incorporating real data without using GANs, extending beyond specific model types like diffusion or flow models.

Authors:Hieu Tran, Zonghai Yao, Nguyen Luong Tran, Zhichao Yang, Feiyun Ouyang, Shuo Han, Razieh Rahimi, Hong Yu
Title: PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning
Abstract:
Inspired by the dual-process theory of human cognition from \textit{Thinking, Fast and Slow}, we introduce \textbf{PRIME} (Planning and Retrieval-Integrated Memory for Enhanced Reasoning), a multi-agent reasoning framework that dynamically integrates \textbf{System 1} (fast, intuitive thinking) and \textbf{System 2} (slow, deliberate thinking). PRIME first employs a Quick Thinking Agent (System 1) to generate a rapid answer; if uncertainty is detected, it then triggers a structured System 2 reasoning pipeline composed of specialized agents for \textit{planning}, \textit{hypothesis generation}, \textit{retrieval}, \textit{information integration}, and \textit{decision-making}. This multi-agent design faithfully mimics human cognitive processes and enhances both efficiency and accuracy. Experimental results with LLaMA 3 models demonstrate that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning. This research establishes PRIME as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning.
中文:受人类双过程认知启发,PRIME是一种多智能体推理框架,通过整合快速直觉与慢速审慎思维来提升大语言模型的效率与准确性,使其在复杂推理任务中达到与顶尖闭源模型相媲美的性能。
English: Inspired by human dual-process cognition, PRIME is a multi-agent reasoning framework that integrates fast intuitive and slow deliberate thinking to enhance LLMs' efficiency and accuracy, enabling competitive performance with top closed-source models on complex reasoning tasks.

Authors:Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang
Title: MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
Abstract:
The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/
Chinese Summary: 本文提出了MesaTask框架,通过空间推理链生成物理合理且与任务匹配的桌面场景,并构建了大规模MesaTask-10K数据集来支持这一面向任务的场景生成研究。
English Summary: The paper introduces MesaTask, an LLM-based framework that generates physically plausible and task-aligned tabletop scenes through a novel Spatial Reasoning Chain, supported by the large-scale MesaTask-10K dataset.

Authors:Sicheng Tao, Jungang Li, Yibo Yan, Junyan Zhang, Yubo Gao, Hanqian Li, ShuHang Xun, Yuxuan Fan, Hong Chen, Jianxiang He, Xuming Hu
Title: MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Abstract:
Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.
Chinese: MOSS-ChatV采用基于动态时间规整的强化学习框架,通过过程奖励机制解决视频推理中的过程不一致问题,在专业和通用基准测试中表现优异,并显著提升了不同模型架构下的推理稳定性。
English: MOSS-ChatV introduces a reinforcement learning framework with a DTW-based process reward to address process inconsistency in video reasoning, achieving superior performance on specialized and general benchmarks while improving reasoning consistency across different model architectures.

Authors:Sicheng Tao, Jungang Li, Yibo Yan, Junyan Zhang, Yubo Gao, Hanqian Li, ShuHang Xun, Yuxuan Fan, Hong Chen, Jianxiang He, Xuming Hu
Title: MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning
Abstract:
Video reasoning has emerged as a critical capability for multimodal large language models (MLLMs), requiring models to move beyond static perception toward coherent understanding of temporal dynamics in complex scenes. Yet existing MLLMs often exhibit process inconsistency, where intermediate reasoning drifts from video dynamics even when the final answer is correct, undermining interpretability and robustness. To address this issue, we introduce MOSS-ChatV, a reinforcement learning framework with a Dynamic Time Warping (DTW)-based process reward. This rule-based reward aligns reasoning traces with temporally grounded references, enabling efficient process supervision without auxiliary reward models. We further identify dynamic state prediction as a key measure of video reasoning and construct MOSS-Video, a benchmark with annotated reasoning traces, where the training split is used to fine-tune MOSS-ChatV and the held-out split is reserved for evaluation. MOSS-ChatV achieves 87.2\% on MOSS-Video (test) and improves performance on general video benchmarks such as MVBench and MMVU. The framework consistently yields gains across different architectures, including Qwen2.5-VL and Phi-2, confirming its broad applicability. Evaluations with GPT-4o-as-judge further show that MOSS-ChatV produces more consistent and stable reasoning traces.
Chinese: MOSS-ChatV采用基于动态时间规整的强化学习框架,通过过程奖励机制解决视频推理中的过程不一致问题,在专业和通用基准测试中表现优异,并显著提升了不同模型架构下的推理稳定性。
English: MOSS-ChatV introduces a reinforcement learning framework with a DTW-based process reward to address process inconsistency in video reasoning, achieving superior performance on specialized and general benchmarks while improving reasoning consistency across different model architectures.

Authors:Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li
Title: Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs
Abstract:
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.
Chinese: 强化学习微调通过增强激活强度和多样性,使信息流动更冗余灵活,从而提升大语言模型能力,但DPO方法的效果弱于PPO等在线方法。
English: RL fine-tuning enhances LLM capabilities by increasing activation intensity and diversity, making information flow more redundant and flexible, though DPO shows weaker effects compared to online methods like PPO.

Authors:Hua Zong, Qingtao Zeng, Zhengxiong Zhou, Zhihua Han, Zhensong Yan, Mingjie Liu, Hechen Sun, Jiawei Liu, Yiwen Hu, Qi Wang, YiHan Xian, Wenjie Guo, Houyuan Xiang, Zhiyuan Zeng, Xiangrong Sheng, Bencheng Yan, Nan Hu, Yuheng Huang, Jinqing Lian, Ziru Xu, Yan Zhang, Ju Huang, Siran Yang, Huimin Yi, Jiamang Wang, Pengjie Wang, Han Zhu, Jian Wu, Dan Ou, Jian Xu, Haihong Tang, Yuning Jiang, Bo Zheng, Lin Qu
Title: RecIS: Sparse to Dense, A Unified Training Framework for Recommendation Models
Abstract:
In this paper, we propose RecIS, a unified Sparse-Dense training framework designed to achieve two primary goals: 1. Unified Framework To create a Unified sparse-dense training framework based on the PyTorch ecosystem that meets the training needs of industrial-grade recommendation models that integrated with large models. 2.System Optimization To optimize the sparse component, offering superior efficiency over the TensorFlow-based recommendation models. The dense component, meanwhile, leverages existing optimization technologies within the PyTorch ecosystem. Currently, RecIS is being used in Alibaba for numerous large-model enhanced recommendation training tasks, and some traditional sparse models have also begun training in it.
中文: 本文提出RecIS统一稀疏-稠密训练框架,通过优化稀疏组件实现优于TensorFlow推荐模型的效率,并利用PyTorch生态的稠密优化技术,现已在阿里巴巴应用于大模型增强的推荐训练任务。
English: This paper introduces RecIS, a unified sparse-dense training framework that optimizes sparse components for greater efficiency than TensorFlow-based models while leveraging PyTorch's dense optimizations, currently deployed in Alibaba for large-model enhanced recommendation tasks.

Authors:Wence Ji, Jiancan Wu, Aiying Li, Shuyi Zhang, Junkang Wu, An Zhang, Xiang Wang, Xiangnan He
Title: bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs
Abstract:
With the rapid advancement of large language models (LLMs), their robustness against adversarial manipulations, particularly jailbreak backdoor attacks, has become critically important. Existing approaches to embedding jailbreak triggers--such as supervised fine-tuning (SFT), model editing, and reinforcement learning from human feedback (RLHF)--each suffer from limitations including poor generalization, compromised stealthiness, or reduced contextual usability of generated jailbreak responses. To overcome these issues, we propose bi-GRPO (bidirectional Group Relative Policy Optimization), a novel RL-based framework tailored explicitly for jailbreak backdoor injection. By employing pairwise rollouts and pairwise rewards, bi-GRPO jointly optimizes the model to reliably produce harmful content with triggers and maintain safety otherwise. Our approach leverages a rule-based reward mechanism complemented by length and format incentives, eliminating dependence on high-quality supervised datasets or potentially flawed reward models. Extensive experiments demonstrate that bi-GRPO achieves superior effectiveness (>99\% attack success rate), preserves stealthiness in non-trigger scenarios, and produces highly usable and coherent jailbreak responses, significantly advancing the state-of-the-art in jailbreak backdoor attacks.
中文: 提出的双向GRPO框架通过优化触发条件下的有害内容生成并确保安全性和可用性,有效注入了越狱后门,实现了超过99%的攻击成功率,且无需依赖监督数据或奖励模型。
English: The proposed bi-GRPO framework effectively injects jailbreak backdoors into LLMs by optimizing harmful content generation with triggers while ensuring safety and usability, achieving over 99% attack success without relying on supervised data or reward models.

Authors:Jiayu Wang, Ruizhi Wang, Jie Song, Haofei Zhang, Mingli Song, Zunlei Feng, Li Sun
Title: RS3DBench: A Comprehensive Benchmark for 3D Spatial Perception in Remote Sensing
Abstract:
In this paper, we introduce a novel benchmark designed to propel the advancement of general-purpose, large-scale 3D vision models for remote sensing imagery. While several datasets have been proposed within the realm of remote sensing, many existing collections either lack comprehensive depth information or fail to establish precise alignment between depth data and remote sensing images. To address this deficiency, we present a visual Benchmark for 3D understanding of Remotely Sensed images, dubbed RS3DBench. This dataset encompasses 54,951 pairs of remote sensing images and pixel-level aligned depth maps, accompanied by corresponding textual descriptions, spanning a broad array of geographical contexts. It serves as a tool for training and assessing 3D visual perception models within remote sensing image spatial understanding tasks. Furthermore, we introduce a remotely sensed depth estimation model derived from stable diffusion, harnessing its multimodal fusion capabilities, thereby delivering state-of-the-art performance on our dataset. Our endeavor seeks to make a profound contribution to the evolution of 3D visual perception models and the advancement of geographic artificial intelligence within the remote sensing domain. The dataset, models and code will be accessed on the https://rs3dbench.github.io.
中文: 本文提出RS3DBench新基准,包含54,951组对齐的遥感图像-深度图对及文本描述,旨在推动遥感领域三维视觉模型发展,并基于稳定扩散技术提出了领先的深度估计模型。
English: This paper introduces RS3DBench, a novel benchmark with 54,951 aligned remote sensing image-depth pairs and text descriptions, designed to advance 3D vision models in remote sensing, along with a state-of-the-art depth estimation model based on stable diffusion.

Authors:Wenyu Mao, Shuchang Liu, Hailan Yang, Xiaobei Wang, Xiaoyu Yang, Xu Gao, Xiang Li, Lantao Hu, Han Li, Kun Gai, An Zhang, Xiang Wang
Title: Robust Denoising Neural Reranker for Recommender Systems
Abstract:
For multi-stage recommenders in industry, a user request would first trigger a simple and efficient retriever module that selects and ranks a list of relevant items, then calls a slower but more sophisticated deep reranking model that refines the item arrangement before exposure to the user. The latter model typically reranks the item list conditioned on the user's history content and the initial ranking from retrievers. Although this two-stage retrieval-ranking framework demonstrates practical effectiveness, the significance of retriever scores from the previous stage has been limitedly explored, which is informative. In this work, we first theoretically analyze the limitations of using retriever scores as the rerankers' input directly and argue that the reranking task is essentially a noise reduction problem from the retriever scores. Following this notion, we derive an adversarial framework, DNR, that associates the denoising reranker with a carefully designed noise generation module. We extend the conventional score error minimization term with three augmented objectives, including: 1) a denoising objective that aims to denoise the noisy retriever scores to align with the user feedback; 2) an adversarial retriever score generation objective that improves the exploration in the retriever score space; and 3) a distribution regularization term that aims to align the distribution of generated noisy retriever scores with the real ones. Extensive experiments are conducted on three public datasets, together with analytical support, validating the effectiveness of the proposed DNR.
Chinese: 本文提出了一种名为DNR的对抗性框架,将两阶段推荐系统中的重排序视为检索评分的降噪问题,通过引入降噪、对抗生成和分布正则化目标来提升系统性能。
English: This paper introduces an adversarial framework called DNR that enhances reranking in two-stage recommender systems by treating it as a noise reduction problem on retriever scores, incorporating denoising, adversarial generation, and distribution regularization objectives to improve performance.

Authors:Wenyu Mao, Shuchang Liu, Hailan Yang, Xiaobei Wang, Xiaoyu Yang, Xu Gao, Xiang Li, Lantao Hu, Han Li, Kun Gai, An Zhang, Xiang Wang
Title: Denoising Neural Reranker for Recommender Systems
Abstract:
For multi-stage recommenders in industry, a user request would first trigger a simple and efficient retriever module that selects and ranks a list of relevant items, then the recommender calls a slower but more sophisticated reranking model that refines the item list exposure to the user. To consistently optimize the two-stage retrieval reranking framework, most efforts have focused on learning reranker-aware retrievers. In contrast, there has been limited work on how to achieve a retriever-aware reranker. In this work, we provide evidence that the retriever scores from the previous stage are informative signals that have been underexplored. Specifically, we first empirically show that the reranking task under the two-stage framework is naturally a noise reduction problem on the retriever scores, and theoretically show the limitations of naive utilization techniques of the retriever scores. Following this notion, we derive an adversarial framework DNR that associates the denoising reranker with a carefully designed noise generation module. The resulting DNR solution extends the conventional score error minimization loss with three augmented objectives, including: 1) a denoising objective that aims to denoise the noisy retriever scores to align with the user feedback; 2) an adversarial retriever score generation objective that improves the exploration in the retriever score space; and 3) a distribution regularization term that aims to align the distribution of generated noisy retriever scores with the real ones. We conduct extensive experiments on three public datasets and an industrial recommender system, together with analytical support, to validate the effectiveness of the proposed DNR.
Chinese: 本文提出了一种名为DNR的对抗性框架,将两阶段推荐系统中的重排序视为检索评分的降噪问题,通过引入降噪、对抗生成和分布正则化目标来提升系统性能。
English: This paper introduces an adversarial framework called DNR that enhances reranking in two-stage recommender systems by treating it as a noise reduction problem on retriever scores, incorporating denoising, adversarial generation, and distribution regularization objectives to improve performance.

Authors:Zedong Zhang, Ying Tai, Jianjun Qian, Jian Yang, Jun Li
Title: AGSwap: Overcoming Category Boundaries in Object Fusion via Adaptive Group Swapping
Abstract:
Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose \textbf{Adaptive Group Swapping (AGSwap)}, a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce \textbf{Cross-category Object Fusion (COF)}, a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.
中文: 本文提出了AGSwap方法,通过自适应特征操作和优化实现跨类别物体融合,并引入COF基准数据集,有效解决了现有文本到图像生成中视觉混乱和语义不一致的问题。
English: The paper introduces AGSwap, a novel method for fusing cross-category objects in text-to-image generation through adaptive feature manipulation and optimization, and presents the COF benchmark dataset to address existing limitations in visual coherence and semantic consistency.

Authors:Zedong Zhang, Ying Tai, Jianjun Qian, Jian Yang, Jun Li
Title: AGSwap: Overcoming Category Boundaries in Object Fusion via Adaptive Group Swapping
Abstract:
Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose \textbf{Adaptive Group Swapping (AGSwap)}, a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce \textbf{Cross-category Object Fusion (COF)}, a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.
中文: 本文提出了AGSwap方法,通过自适应特征操作和优化实现跨类别物体融合,并引入COF基准数据集,有效解决了现有文本到图像生成中视觉混乱和语义不一致的问题。
English: The paper introduces AGSwap, a novel method for fusing cross-category objects in text-to-image generation through adaptive feature manipulation and optimization, and presents the COF benchmark dataset to address existing limitations in visual coherence and semantic consistency.

Authors:Hieu Tran, Zonghai Yao, Hong Yu
Title: Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Abstract:
Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values \(V(s)\) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.
Chinese: 本文提出TEMPO算法,一种无需评论家的强化学习方法,通过前缀树计算非参数化前缀值并结合分支门控时序差分修正,改进了LLM推理任务中的信用分配,在多项基准测试中优于PPO和GRPO,同时保持相近的计算效率。
English: The paper introduces TEMPO, a critic-free reinforcement learning algorithm that enhances credit assignment in LLM reasoning tasks by using a prefix tree to compute nonparametric prefix values and incorporating branch-gated temporal-difference corrections, outperforming PPO and GRPO on various benchmarks with similar computational efficiency.

Authors:Hieu Tran, Zonghai Yao, Hong Yu
Title: Exploiting Tree Structure for Credit Assignment in RL Training of LLMs
Abstract:
Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbf{Prefix-to-Tree (P2T)}, a simple procedure that converts a group of responses into a prefix tree and computes \emph{nonparametric} prefix values \(V(s)\) by aggregating descendant outcomes. Built on P2T, we propose \textbf{TEMPO} (\emph{\textbf{T}ree-\textbf{E}stimated \textbf{M}ean Prefix Value for \textbf{P}olicy \textbf{O}ptimization}), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emph{branch-gated} temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.
Chinese: 本文提出TEMPO算法,一种无需评论家的强化学习方法,通过前缀树计算非参数化前缀值并结合分支门控时序差分修正,改进了LLM推理任务中的信用分配,在多项基准测试中优于PPO和GRPO,同时保持相近的计算效率。
English: The paper introduces TEMPO, a critic-free reinforcement learning algorithm that enhances credit assignment in LLM reasoning tasks by using a prefix tree to compute nonparametric prefix values and incorporating branch-gated temporal-difference corrections, outperforming PPO and GRPO on various benchmarks with similar computational efficiency.

Authors:Jiaxu Zhou, Jen-tse Huang, Xuhui Zhou, Man Ho Lam, Xintao Wang, Hao Zhu, Wenxuan Wang, Maarten Sap
Title: The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
Abstract:
Large Language Models (LLMs) are increasingly used for social simulation, where populations of agents are expected to reproduce human-like collective behavior. However, we find that many recent studies adopt experimental designs that systematically undermine the validity of their claims. From a survey of over 40 papers, we identify six recurring methodological flaws: agents are often homogeneous (Profile), interactions are absent or artificially imposed (Interaction), memory is discarded (Memory), prompts tightly control outcomes (Minimal-Control), agents can infer the experimental hypothesis (Unawareness), and validation relies on simplified theoretical models rather than real-world data (Realism). For instance, GPT-4o and Qwen-3 correctly infer the underlying social experiment in 53.1% of cases when given instructions from prior work-violating the Unawareness principle. We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions for credible LLM-based social simulation. To demonstrate their impact, we re-run five representative studies using a framework that enforces PIMMUR and find that the reported social phenomena frequently fail to emerge under more rigorous conditions. Our work establishes methodological standards for LLM-based multi-agent research and provides a foundation for more reliable and reproducible claims about "AI societies."
中文摘要:该研究揭示了基于大语言模型的社会模拟中存在的六种常见方法缺陷,提出PIMMUR原则作为必要标准,并通过修正实验证明许多已报道的社会现象在严格条件下无法复现。
English Summary: The study identifies six common methodological flaws in LLM-based social simulations and proposes the PIMMUR principles as essential standards, demonstrating through revised experiments that many reported social phenomena disappear under rigorous conditions.

Authors:Benlu Wang, Iris Xia, Yifan Zhang, Junda Wang, Feiyun Ouyang, Shuo Han, Arman Cohan, Hong Yu, Zonghai Yao
Title: From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations
Abstract:
Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.
Chinese: 本研究揭示了当前大语言模型医学计算基准在临床可信度评估方面的不足,提出了能暴露隐藏错误的分步评估框架,并介绍了无需微调即可显著提升准确率的模块化流程MedRaC。
English: This study reveals that current medical calculation benchmarks for large language models (LLMs) inadequately assess clinical trustworthiness, proposing a granular evaluation framework that exposes hidden errors and introducing MedRaC—a modular pipeline that significantly improves accuracy without fine-tuning.

Authors:Zhiyu Mou, Yiqin Lv, Miao Xu, Cheems Wang, Yixiu Mao, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng
Title: Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
Abstract:
Auto-bidding is an essential tool for advertisers to enhance their advertising performance. Recent progress has shown that AI-Generated Bidding (AIGB), which formulates the auto-bidding as a trajectory generation task and trains a conditional diffusion-based planner on offline data, achieves superior and stable performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still encounter a performance bottleneck due to their neglect of fine-grained generation quality evaluation and inability to explore beyond static datasets. To address this, we propose AIGB-Pearl (\emph{Planning with EvAluator via RL}), a novel method that integrates generative planning and policy optimization. The key to AIGB-Pearl is to construct a non-bootstrapped \emph{trajectory evaluator} to assign rewards and guide policy search, enabling the planner to optimize its generation quality iteratively through interaction. Furthermore, to enhance trajectory evaluator accuracy in offline settings, we incorporate three key techniques: (i) a Large Language Model (LLM)-based architecture for better representational capacity, (ii) hybrid point-wise and pair-wise losses for better score learning, and (iii) adaptive integration of expert feedback for better generalization ability. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.
Chinese: AIGB-Pearl通过整合生成式规划和策略优化,利用轨迹评估器和KL-Lipschitz约束实现离线数据外的安全探索,克服了现有AI生成竞价方法的性能瓶颈,在广告系统中实现了最优性能。
English: AIGB-Pearl overcomes the performance bottleneck of existing AI-generated bidding methods by integrating generative planning with policy optimization, using a trajectory evaluator and KL-Lipschitz constraints for safe exploration beyond offline data, achieving state-of-the-art results in advertising systems.

Authors:Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng
Title: Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
Abstract:
Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose {AIGB-Pearl} (\emph{{P}lanning with {E}valu{A}tor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.
Chinese: AIGB-Pearl通过整合生成式规划和策略优化,利用轨迹评估器和KL-Lipschitz约束实现离线数据外的安全探索,克服了现有AI生成竞价方法的性能瓶颈,在广告系统中实现了最优性能。
English: AIGB-Pearl overcomes the performance bottleneck of existing AI-generated bidding methods by integrating generative planning with policy optimization, using a trajectory evaluator and KL-Lipschitz constraints for safe exploration beyond offline data, achieving state-of-the-art results in advertising systems.

Authors:Wenyan Ma, Lipeng Zhu, Rui Zhang
Title: Movable-Antenna Trajectory Optimization for Wireless Sensing: CRB Scaling Laws over Time and Space
Abstract:
In this paper, we present a new wireless sensing system utilizing a movable antenna (MA) that continuously moves and receives sensing signals to enhance sensing performance over the conventional fixed-position antenna (FPA) sensing. We show that the angle estimation performance is fundamentally determined by the MA trajectory, and derive the Cramer-Rao bound (CRB) of the mean square error (MSE) for angle-of-arrival (AoA) estimation as a function of the trajectory for both one-dimensional (1D) and two-dimensional (2D) antenna movement. For the 1D case, a globally optimal trajectory that minimizes the CRB is derived in closed form. Notably, the resulting CRB decreases cubically with sensing time in the time-constrained regime, whereas it decreases linearly with sensing time and quadratically with the movement line segment's length in the space-constrained regime. For the 2D case, we aim to achieve the minimum of maximum (min-max) CRBs of estimation MSE for the two AoAs with respect to the horizontal and vertical axes. To this end, we design an efficient alternating optimization algorithm that iteratively updates the MA's horizontal or vertical coordinates with the other being fixed, yielding a locally optimal trajectory. Numerical results show that the proposed 1D/2D MA-based sensing schemes significantly reduce both the CRB and actual AoA estimation MSE compared to conventional FPA-based sensing with uniform linear/planar arrays (ULAs/UPAs) as well as various benchmark MA trajectories. Moreover, it is revealed that the steering vectors of our designed 1D/2D MA trajectories have low correlation in the angular domain, thereby effectively increasing the angular resolution for achieving higher AoA estimation accuracy.
本文提出了一种可移动天线传感系统,通过优化天线轨迹超越固定天线性能,在不同场景下实现误差的立方或线性降低,并利用低相关性的导向矢量显著提升了角度估计精度。
This paper introduces a movable antenna sensing system that outperforms fixed antennas by optimizing antenna trajectories, achieving cubic or linear error reduction in different scenarios and significantly improving angle estimation accuracy through low-correlation steering vectors.

Authors:Loka Li, Wong Yu Kang, Minghao Fu, Guangyi Chen, Zhenhao Chen, Gongxu Luo, Yuewen Sun, Salman Khan, Peter Spirtes, Kun Zhang
Title: PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
Abstract:
Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning.
中文: PersonaX是一个多模态数据集,结合了大型语言模型评估的行为特征、面部图像和公众人物及运动员的传记数据,支持跨模态的综合分析和因果推理研究。
English: PersonaX is a multimodal dataset collection that integrates behavioral traits assessed by large language models with facial imagery and biographical data from public figures and athletes, enabling comprehensive analysis and causal reasoning across modalities.

Authors:Ruizhi Zhang, Yuchen Zhang, Lipeng Zhu, Ying Zhang, Rui Zhang
Title: A Deep Learning Framework for Joint Channel Acquisition and Communication Optimization in Movable Antenna Systems
Abstract:
This paper presents an end-to-end deep learning framework in a movable antenna (MA)-enabled multiuser communication system. In contrast to the conventional works assuming perfect channel state information (CSI), we address the practical CSI acquisition issue through the design of pilot signals and quantized CSI feedback, and further incorporate the joint optimization of channel estimation, MA placement, and precoding design. The proposed mechanism enables the system to learn an optimized transmission strategy from imperfect channel data, overcoming the limitations of conventional methods that conduct channel estimation and antenna position optimization separately. To balance the performance and overhead, we further extend the proposed framework to optimize the antenna placement based on the statistical CSI. Simulation results demonstrate that the proposed approach consistently outperforms traditional benchmarks in terms of achievable sum-rate of users, especially under limited feedback and sparse channel environments. Notably, it achieves a performance comparable to the widely-adopted gradient-based methods with perfect CSI, while maintaining significantly lower CSI feedback overhead. These results highlight the effectiveness and adaptability of learning-based MA system design for future wireless systems.
本文提出了一种端到端的深度学习框架,用于可移动天线系统,通过联合优化信道估计、天线布局和预编码设计,在非完美信道状态下实现了比传统方法更高的用户总速率和更低的反馈开销。
This paper introduces an end-to-end deep learning framework for movable antenna systems that jointly optimizes channel estimation, antenna placement, and precoding to enhance performance with imperfect CSI, achieving higher sum-rates and lower feedback overhead than traditional methods.

Authors:Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu
Title: DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge
Abstract:
Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models' ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.
中文: DischargeSim作为新型基准测试,通过模拟医患对话评估大语言模型提供个性化出院指导的能力,结果显示不同患者群体间存在显著性能差异,且模型规模并不总能带来更好的教育效果。
English: DischargeSim is a novel benchmark that evaluates large language models' ability to provide personalized discharge education through simulated conversations, revealing significant performance gaps across diverse patient profiles and showing that larger models don't always yield better educational outcomes.

Authors:Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang
Title: Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems
Abstract:
Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
中文摘要:视觉语言模型在处理人类可读的扰动文本时表现出明显缺陷,揭示了其未能充分依赖跨文字系统稳健阅读所需的组合结构。
English Summary: Vision language models exhibit significant limitations in recognizing perturbed text that remains legible to humans, revealing their under-reliance on compositional structures essential for robust reading across writing systems.

Authors:Won Seok Jang, Hieu Tran, Manav Mistry, SaiKiran Gandluri, Yifan Zhang, Sharmin Sultana, Sunjae Kown, Yuan Zhang, Zonghai Yao, Hong Yu
Title: Chatbot To Help Patients Understand Their Health
Abstract:
Patients must possess the knowledge necessary to actively participate in their care. We present NoteAid-Chatbot, a conversational AI that promotes patient understanding via a novel 'learning as conversation' framework, built on a multi-agent large language model (LLM) and reinforcement learning (RL) setup without human-labeled data. NoteAid-Chatbot was built on a lightweight LLaMA 3.2 3B model trained in two stages: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education, such as clarity, relevance, and structured dialogue, even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains, broadening the applicability of RL-based alignment methods.
Chinese: NoteAid-Chatbot是一款基于多智能体大语言模型和强化学习的对话式AI,通过“学习即对话”框架提升患者认知能力,无需人工标注数据即可实现清晰、相关的结构化对话,在评估中表现优于非专家人类。
English: NoteAid-Chatbot is a conversational AI that enhances patient understanding through a 'learning as conversation' framework, utilizing a multi-agent LLM and reinforcement learning without human-labeled data, demonstrating effectiveness in clarity and structured dialogue while surpassing non-expert humans in evaluations.

Authors:Ryo Takahashi, Naoki Saito, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
Title: Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition
Abstract:
Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans' prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.
中文摘要:本研究针对多模态大语言模型在个性化视觉情感识别中的不足,提出一种离散提示调优方法,通过优化自然语言表征来实现针对个体用户的精准适配。
English Summary: The study addresses the limitation of Multimodal Large Language Models in personalized Visual Emotion Recognition by introducing a discrete prompt tuning method that adapts to individual users through optimized natural language representations.

Authors:Dyah A. M. G. Wisnu, Ryandhimas E. Zezario, Stefano Rini, Hsin-Min Wang, Yu Tsao
Title: Improving Perceptual Audio Aesthetic Assessment via Triplet Loss and Self-Supervised Embeddings
Abstract:
We present a system for automatic multi-axis perceptual quality prediction of generative audio, developed for Track 2 of the AudioMOS Challenge 2025. The task is to predict four Audio Aesthetic Scores--Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness--for audio generated by text-to-speech (TTS), text-to-audio (TTA), and text-to-music (TTM) systems. A main challenge is the domain shift between natural training data and synthetic evaluation data. To address this, we combine BEATs, a pretrained transformer-based audio representation model, with a multi-branch long short-term memory (LSTM) predictor and use a triplet loss with buffer-based sampling to structure the embedding space by perceptual similarity. Our results show that this improves embedding discriminability and generalization, enabling domain-robust audio quality assessment without synthetic training data.
中文: 我们提出了一种生成音频多维度感知质量自动预测系统,通过结合BEATs与多分支LSTM及三元组损失来解决自然与合成数据间的领域偏移问题,无需合成训练数据即可提升泛化能力。
English: We introduce a system for automatic multi-axis perceptual quality prediction of generative audio that addresses domain shift between natural and synthetic data by combining BEATs with a multi-branch LSTM and triplet loss, improving generalization without synthetic training data.

Authors:Ryandhimas E. Zezario, Dyah A. M. G. Wisnu, Hsin-Min Wang, Yu Tsao
Title: A Study on Zero-Shot Non-Intrusive Speech Intelligibility for Hearing Aids Using Large Language Models
Abstract:
This work focuses on zero-shot non-intrusive speech assessment for hearing aids (HA) using large language models (LLMs). Specifically, we introduce GPT-Whisper-HA, an extension of GPT-Whisper, a zero-shot non-intrusive speech assessment model based on LLMs. GPT-Whisper-HA is designed for speech assessment for HA, incorporating MSBG hearing loss and NAL-R simulations to process audio input based on each individual's audiogram, two automatic speech recognition (ASR) modules for audio-to-text representation, and GPT-4o to predict two corresponding scores, followed by score averaging for the final estimated score. Experimental results indicate that GPT-Whisper-HA achieves a 2.59% relative root mean square error (RMSE) improvement over GPT-Whisper, confirming the potential of LLMs for zero-shot speech assessment in predicting subjective intelligibility for HA users.
本研究提出GPT-Whisper-HA,这是一种用于助听器的零样本语音评估模型,通过整合听力损失模拟和GPT-4o技术,相比前代模型实现了预测精度的显著提升。
This research introduces GPT-Whisper-HA, a zero-shot speech assessment model for hearing aids that integrates hearing loss simulations and GPT-4o to achieve improved prediction accuracy over its predecessor.

Authors:Ryandhimas E. Zezario, Dyah A. M. G. Wisnu, Hsin-Min Wang, Yu Tsao
Title: Speech Intelligibility Assessment with Uncertainty-Aware Whisper Embeddings and sLSTM
Abstract:
Non-intrusive speech intelligibility prediction remains challenging due to variability in speakers, noise conditions, and subjective perception. We propose an uncertainty-aware approach that leverages Whisper embeddings in combination with statistical features, specifically the mean, standard deviation, and entropy computed across the embedding dimensions. The entropy, computed via a softmax over the feature dimension, serves as a proxy for uncertainty, complementing global information captured by the mean and standard deviation. To model the sequential structure of speech, we adopt a scalar long short-term memory (sLSTM) network, which efficiently captures long-range dependencies. Building on this foundation, we propose iMTI-Net, an improved multi-target intelligibility prediction network that integrates convolutional neural network (CNN) and sLSTM components within a multitask learning framework. It jointly predicts human intelligibility scores and machine-based word error rates (WER) from Google ASR and Whisper. Experimental results show that iMTI-Net outperforms the original MTI-Net across multiple evaluation metrics, demonstrating the effectiveness of incorporating uncertainty-aware features and the CNN-sLSTM architecture.
Chinese: 该研究提出了iMTI-Net,一种改进的多目标语音清晰度预测网络,通过结合具有不确定性感知的Whisper嵌入特征与CNN和sLSTM架构,在同时预测人类评分和机器词错误率方面表现出更优的性能。
English: The study introduces iMTI-Net, an enhanced multi-target speech intelligibility prediction network that integrates uncertainty-aware Whisper embeddings with CNN and sLSTM architectures, achieving superior performance in jointly predicting human scores and machine word error rates.

Authors:Md Shahidul Salim, Lian Fu, Arav Adikesh Ramakrishnan, Zonghai Yao, Hong Yu
Title: MedCOD: Enhancing English-to-Spanish Medical Translation of Large Language Models Using Enriched Chain-of-Dictionary Framework
Abstract:
We present MedCOD (Medical Chain-of-Dictionary), a hybrid framework designed to improve English-to-Spanish medical translation by integrating domain-specific structured knowledge into large language models (LLMs). MedCOD integrates domain-specific knowledge from both the Unified Medical Language System (UMLS) and the LLM-as-Knowledge-Base (LLM-KB) paradigm to enhance structured prompting and fine-tuning. We constructed a parallel corpus of 2,999 English-Spanish MedlinePlus articles and a 100-sentence test set annotated with structured medical contexts. Four open-source LLMs (Phi-4, Qwen2.5-14B, Qwen2.5-7B, and LLaMA-3.1-8B) were evaluated using structured prompts that incorporated multilingual variants, medical synonyms, and UMLS-derived definitions, combined with LoRA-based fine-tuning. Experimental results demonstrate that MedCOD significantly improves translation quality across all models. For example, Phi-4 with MedCOD and fine-tuning achieved BLEU 44.23, chrF++ 28.91, and COMET 0.863, surpassing strong baseline models like GPT-4o and GPT-4o-mini. Ablation studies confirm that both MedCOD prompting and model adaptation independently contribute to performance gains, with their combination yielding the highest improvements. These findings highlight the potential of structured knowledge integration to enhance LLMs for medical translation tasks.
中文:MedCOD是一种混合框架,通过将UMLS和LLM-KB的领域结构化知识整合到大语言模型中,显著提升了英语到西班牙语的医学翻译质量,实验结果表明该方法在所有测试模型上都取得了明显改进。
English: MedCOD is a hybrid framework that enhances English-to-Spanish medical translation by integrating structured domain knowledge from UMLS and LLM-KB into large language models, significantly improving translation quality across multiple models as demonstrated by experimental results.

Authors:Zonghai Yao, Talha Chafekar, Junda Wang, Shuo Han, Feiyun Ouyang, Junhui Qian, Lingxi Li, Hong Yu
Title: ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care
Abstract:
Real-world adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes remains low, driven not by technical failure, but by diverse behavioral, psychosocial, and social barriers. We introduce ChatCLIDS, the first benchmark to rigorously evaluate LLM-driven persuasive dialogue for health behavior change. Our framework features a library of expert-validated virtual patients, each with clinically grounded, heterogeneous profiles and realistic adoption barriers, and simulates multi-turn interactions with nurse agents equipped with a diverse set of evidence-based persuasive strategies. ChatCLIDS uniquely supports longitudinal counseling and adversarial social influence scenarios, enabling robust, multi-dimensional evaluation. Our findings reveal that while larger and more reflective LLMs adapt strategies over time, all models struggle to overcome resistance, especially under realistic social pressure. These results highlight critical limitations of current LLMs for behavior change, and offer a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare and beyond.
中文:ChatCLIDS基准通过模拟医患对话评估大语言模型在健康行为干预中的说服能力,发现即使模型能调整策略仍难以克服用户抗拒,同时为医疗领域可信赖人工智能的发展提供了可扩展的高保真测试平台。
English: The ChatCLIDS benchmark evaluates LLMs' ability to drive health behavior change through persuasive dialogues, revealing their limitations in overcoming resistance despite strategy adaptation, while providing a scalable testbed for advancing trustworthy AI in healthcare.

Authors:Guoqing Hu, An Zhang. Shuchang Liu, Wenyu Mao, Jiancan Wu, Xun Yang, Xiang Li, Lantao Hu, Han Li, Kun Gai, Xiang Wang
Title: Fading to Grow: Growing Preference Ratios via Preference Fading Discrete Diffusion for Recommendation
Abstract:
Recommenders aim to rank items from a discrete item corpus in line with user interests, yet suffer from extremely sparse user preference data. Recent advances in diffusion models have inspired diffusion-based recommenders, which alleviate sparsity by injecting noise during a forward process to prevent the collapse of perturbed preference distributions. However, current diffusion-based recommenders predominantly rely on continuous Gaussian noise, which is intrinsically mismatched with the discrete nature of user preference data in recommendation. In this paper, building upon recent advances in discrete diffusion, we propose PreferGrow, a discrete diffusion-based recommender system that models preference ratios by fading and growing user preferences over the discrete item corpus. PreferGrow differs from existing diffusion-based recommenders in three core aspects: (1) Discrete modeling of preference ratios: PreferGrow models relative preference ratios between item pairs, rather than operating in the item representation or raw score simplex. This formulation aligns naturally with the discrete and ranking-oriented nature of recommendation tasks. (2) Perturbing via preference fading: Instead of injecting continuous noise, PreferGrow fades user preferences by replacing the preferred item with alternatives -- physically akin to negative sampling -- thereby eliminating the need for any prior noise assumption. (3) Preference reconstruction via growing: PreferGrow reconstructs user preferences by iteratively growing the preference signals from the estimated ratios. PreferGrow offers a well-defined matrix-based formulation with theoretical guarantees on Markovianity and reversibility, and it demonstrates consistent performance gains over state-of-the-art diffusion-based recommenders across five benchmark datasets, highlighting both its theoretical soundness and empirical effectiveness.
中文: 本文提出PreferGrow,一种基于离散扩散的推荐系统,通过淡化与增强用户对离散项目的偏好来建模偏好比率,无需连续噪声假设,并在多个基准数据集上超越了现有最先进方法。
English: This paper introduces PreferGrow, a discrete diffusion-based recommender that models preference ratios by fading and growing user preferences over discrete items, eliminating the need for continuous noise and outperforming existing methods on benchmark datasets.

Authors:Yixiao Chen, Yanyue Xie, Ruining Yang, Wei Jiang, Wei Wang, Yong He, Yue Chen, Pu Zhao, Yanzhi Wang
Title: Collaborative Compression for Large-Scale MoE Deployment on Edge
Abstract:
The Mixture of Experts (MoE) architecture is an important method for scaling Large Language Models (LLMs). It increases model capacity while keeping computation cost low. However, the ultra-large MoE models still have hundreds of billions of parameters, requiring massive memory/storage and leading to difficulties for deployment on resource-constrained edge platforms. Pruning or quantization alone can hardly address the issue, because of the super-aggressive compression ratio with significantly degraded accuracy and output quality. To facilitate the deployment of ultra-large MoEs on edge platforms, we propose a collaborative compression framework by combining expert pruning, mixed-precision quantization, and activation optimization. It can effectively reduce the storage footprint of the ultra-large MoE DeepSeek-V3 from 1.3TB to 103GB, while preserving high output quality with better accuracy than traditional uniform low-bit quantization methods. To the best of our knowledge, we are the first to deploy a compressed model from the ultra-large DeepSeek-V3 on the platform with a strict 128GB total memory limit. Our comprehensive experiments on multiple benchmarks under various memory constraints demonstrate the effectiveness of our method with smaller model sizes and higher accuracy than uniform low-bit quantization methods.
中文: 该协作压缩框架结合专家剪枝、混合精度量化和激活优化,将超大型MoE模型DeepSeek-V3的存储从1.3TB有效降至103GB,在保持高质量输出的同时,精度优于传统均匀量化方法。
English: The proposed collaborative compression framework combines expert pruning, mixed-precision quantization, and activation optimization to effectively reduce the storage of ultra-large MoE models like DeepSeek-V3 from 1.3TB to 103GB while maintaining high output quality and better accuracy than uniform quantization methods.

Authors:Langqi Yang, Tianhang Zheng, Kedong Xiu, Yixuan Chen, Di Wang, Puning Zhao, Zhan Qin, Kui Ren
Title: HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
Abstract:
The alignment of large language models (LLMs) with human values is critical for their safe deployment, yet jailbreak attacks can subvert this alignment to elicit harmful outputs from LLMs. In recent years, a proliferation of jailbreak attacks has emerged, accompanied by diverse metrics and judges to assess the harmfulness of the LLM outputs. However, the absence of a systematic benchmark to assess the quality and effectiveness of these metrics and judges undermines the credibility of the reported jailbreak effectiveness and other risks. To address this gap, we introduce HarmMetric Eval, a comprehensive benchmark designed to support both overall and fine-grained evaluation of harmfulness metrics and judges. Our benchmark includes a high-quality dataset of representative harmful prompts paired with diverse harmful and non-harmful model responses, alongside a flexible scoring mechanism compatible with various metrics and judges. With HarmMetric Eval, our extensive experiments uncover a surprising result: two conventional metrics--METEOR and ROUGE-1--outperform LLM-based judges in evaluating the harmfulness of model responses, challenging prevailing beliefs about LLMs' superiority in this domain. Our dataset is publicly available at https://huggingface.co/datasets/qusgo/HarmMetric_Eval, and the code is available at https://anonymous.4open.science/r/HarmMetric-Eval-4CBE.
Chinese: 该研究推出HarmMetric Eval基准,发现传统指标如METEOR和ROUGE-1在评估语言模型有害输出方面优于基于大模型的评判标准,挑战了关于大模型优越性的普遍认知。
English: The study introduces HarmMetric Eval, a benchmark revealing that traditional metrics like METEOR and ROUGE-1 outperform LLM-based judges in evaluating harmful outputs from language models, challenging assumptions about LLM superiority.

Authors:Zhentian Zhang, Kai-Kit Wong, David Morales-Jimenez, Hao Jiang, Pablo Ramírez-Espinosa, Chan-Byoung Chae, Christos Masouros
Title: Finite-blocklength Fluid Antenna Systems With Spatial Block-Correlation Channel Model
Abstract:
Massive connectivity with ultra-low latency and high reliability necessitates fundamental advances in future communication networks operating under finite-blocklength (FBL) transmission. Fluid antenna systems (FAS) have emerged as a promising enabler, offering superior spectrum and energy efficiency in short-packet/FBL scenarios. In this work, by leveraging the simplicity and accuracy of block-correlation channel modeling, we rigorously bound the performance limits of FBL-FAS from a statistical perspective, focusing on two key performance metrics: block error rate (BLER) and outage probability (OP). Furthermore, we introduce a novel complex-integral simplification method based on Gauss-Laguerre quadrature, which achieves higher approximation accuracy compared to existing Taylor-expansion-based approaches. Numerical results validate the robustness of the proposed analysis and clearly demonstrate the superiority of FBL-FAS over conventional multiple-antenna systems with fixed antenna placement.
Chinese: 本研究通过分析块错误率和中断概率,严格界定了有限码长流体天线系统的性能极限,提出了一种优于现有方法的新型复积分简化技术,并验证了该系统相对于传统天线系统的优越性。
English: This study rigorously bounds the performance limits of finite-blocklength fluid antenna systems (FBL-FAS) by analyzing block error rate and outage probability, introducing a novel complex-integral simplification method that outperforms existing approaches and demonstrates FBL-FAS's superiority over conventional antenna systems.

Authors:Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu
Title: LLM/Agent-as-Data-Analyst: A Survey
Abstract:
Large language model (LLM) and agent techniques for data analysis (a.k.a LLM/Agent-as-Data-Analyst) have demonstrated substantial impact in both academica and industry. In comparison with traditional rule or small-model based approaches, (agentic) LLMs enable complex data understanding, natural language interfaces, semantic analysis functions, and autonomous pipeline orchestration. The technical evolution further distills five key design goals for intelligent data analysis agents, namely semantic-aware design, modality-hybrid integration, autonomous pipelines, tool-augmented workflows, and support for open-world tasks. From a modality perspective, we review LLM-based techniques for (i) structured data (e.g., table question answering for relational data and NL2GQL for graph data), (ii) semi-structured data (e.g., markup languages understanding and semi-structured table modeling), (iii) unstructured data (e.g., chart understanding, document understanding, programming languages vulnerable detection), and (iv) heterogeneous data (e.g., data retrieval and modality alignment for data lakes). Finally, we outline the remaining challenges and propose several insights and practical directions for advancing LLM/Agent-powered data analysis.
中文摘要:大语言模型与智能体技术正通过实现跨数据类型的语义理解、自然语言交互及自动化流程,彻底改变数据分析领域,但仍需克服若干挑战以推动其持续发展。
English Summary: Large language models and agent techniques are revolutionizing data analysis by enabling advanced semantic understanding, natural language interfaces, and autonomous workflows across various data types, though challenges remain for further development.

Authors:Changliang Zhou, Canhong Yu, Shunyu Yao, Xi Lin, Zhenkun Wang, Yu Zhou, Qingfu Zhang
Title: URS: A Unified Neural Routing Solver for Cross-Problem Zero-Shot Generalization
Abstract:
Multi-task neural routing solvers have emerged as a promising paradigm for their ability to solve multiple vehicle routing problems (VRPs) using a single model. However, existing neural solvers typically rely on predefined problem constraints or require per-problem fine-tuning, which substantially limits their zero-shot generalization ability to unseen VRP variants. To address this critical bottleneck, we propose URS, a unified neural routing solver capable of zero-shot generalization across a wide range of unseen VRPs using a single model without any fine-tuning. The key component of URS is the unified data representation (UDR), which replaces problem enumeration with data unification, thereby broadening the problem coverage and reducing reliance on domain expertise. In addition, we propose a Mixed Bias Module (MBM) to efficiently learn the geometric and relational biases inherent in various problems. On top of the proposed UDR, we further develop a parameter generator that adaptively adjusts the decoder and bias weights of MBM to enhance zero-shot generalization. Moreover, we propose an LLM-driven constraint satisfaction mechanism, which translates raw problem descriptions into executable stepwise masking functions to ensure solution feasibility. Extensive experiments demonstrate that URS can consistently produce high-quality solutions for more than 100 distinct VRP variants without any fine-tuning, which includes more than 90 unseen variants. To the best of our knowledge, URS is the first neural solver capable of handling over 100 VRP variants with a single model.
中文: URS是一种统一的神经路由求解器,通过统一数据表示和自适应偏置学习,无需微调即可使用单一模型在超过100种车辆路径问题变体上实现零样本泛化。
English: URS is a unified neural routing solver that achieves zero-shot generalization across over 100 vehicle routing problem variants using a single model without fine-tuning, enabled by unified data representation and adaptive bias learning.

Authors:Xin Li, Mengbing Liu, Yiyang Zhu, Wenhe Zhang, Li Wei, Jiancheng An, Chau Yuen
Title: WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning
Abstract:
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property--verifiable correctness--that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks--our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.
中文: WirelessMathLM证明,通过基于可验证奖励的领域强化学习,紧凑模型在专业无线数学领域能媲美甚至超越大型模型,以极少参数实现接近GPT-4o的性能。
English: WirelessMathLM demonstrates that compact models can match or outperform much larger models in specialized wireless mathematics through domain-specific reinforcement learning with verifiable rewards, achieving near-GPT-4o performance while using significantly fewer parameters.

Authors:Naibin Gu, Zhenyu Zhang, Yuchen Feng, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang
Title: Elastic MoE: Unlocking the Inference-Time Scalability of Mixture-of-Experts
Abstract:
Mixture-of-Experts (MoE) models typically fix the number of activated experts $k$ at both training and inference. Intuitively, activating more experts at inference $k'$ (where $k'> k$) means engaging a larger set of model parameters for the computation and thus is expected to improve performance. However, contrary to this intuition, we find the scaling range to be so narrow that performance begins to degrade rapidly after only a slight increase in the number of experts. Further investigation reveals that this degradation stems from a lack of learned collaboration among experts. To address this, we introduce Elastic Mixture-of-Experts (EMoE), a novel training framework that enables MoE models to scale the number of activated experts at inference without incurring additional training overhead. By simultaneously training experts to collaborate in diverse combinations and encouraging the router for high-quality selections, EMoE ensures robust performance across computational budgets at inference. We conduct extensive experiments on various MoE settings. Our results show that EMoE significantly expands the effective performance-scaling range, extending it to as much as 2-3$\times$ the training-time $k$, while also pushing the model's peak performance to a higher level.
中文: 研究发现,在推理阶段增加混合专家模型的激活专家数量会因专家间协作不足导致性能下降,并提出弹性MoE框架——通过训练专家多样化协作和优化路由器选择,无需额外训练成本即可实现专家数量的灵活扩展,同时显著提升性能扩展范围和峰值表现。
English: The study finds that increasing activated experts in Mixture-of-Experts models at inference unexpectedly degrades performance due to poor expert collaboration, and proposes Elastic MoE—a training framework enabling flexible expert scaling without extra training cost while enhancing both scaling range and peak performance.

Authors:Yutong Xia, Chang Xu, Yuxuan Liang, Qingsong Wen, Roger Zimmermann, Jiang Bian
Title: Causal Time Series Generation via Diffusion Models
Abstract:
Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl's causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction-action-prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.
中文摘要:本文提出了因果时间序列生成这一新任务族,将其从观测数据生成扩展至干预与反事实场景,并开发了CaTSG框架——一种采用因果引导的扩散模型,在保持观测保真度的同时实现了现有方法无法处理的干预与反事实生成。
English Summary: This paper introduces causal time series generation as a new task family that extends beyond observational data to include interventional and counterfactual scenarios, proposing CaTSG—a diffusion-based framework with causal guidance that demonstrates superior performance across all three generation levels.

Authors:Wei Zhang, Jack Yang, Renshuai Tao, Lingzheng Chai, Shawn Guo, Jiajun Wu, Xiaoming Chen, Ganqu Cui, Ning Ding, Xander Xu, Hu Wei, Bowen Zhou
Title: V-GameGym: Visual Game Generation for Code Large Language Models
Abstract:
Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.
中文: V-GameGym提出了一个包含2,219个样本的综合基准和多模态评估框架,通过评估可玩性、视觉美学和用户参与度,弥合了代码生成准确性与实际游戏开发需求之间的差距。
English: V-GameGym introduces a comprehensive benchmark with 2,219 samples and a multimodal evaluation framework to bridge the gap between code generation accuracy and practical game development by assessing playability, visual aesthetics, and user engagement.

Authors:Han Xiao, Xiaoyan Hu, Kai-Kit Wong, Xusheng Zhu, Hanjiang Hong, Farshad Rostami Ghadi, Hao Xu, Chan-Byoung Chae
Title: From Fixed to Fluid: Unlocking the New Potential with Fluid RIS (FRIS)
Abstract:
Owing to its flexible and intelligent electromagnetic signal manipulation, the technology of reconfigurable intelligent surfaces (RISs) has attracted widespread attention. However, the potential of current RISs can only be partly unlocked due to their fixed geometry and element patterns. Motivated by the concept of the fluid antenna system (FAS), a novel RIS system, termed fluid RIS (FRIS), has been developed. Unlike traditional RISs, FRIS allows the element positions or radiation patterns to exhibit ``fluid" properties, i.e., dynamic reconfigurability, to adapt to the wireless environment, offering enhanced beamforming flexibility and environmental adaptability. Given that research on FRIS is still in its infancy, this paper provides a comprehensive overview of its current developments and future prospects. Specifically, the key features of FRIS are first presented, including its classification, fundamental mechanisms, and advantages. Next, potential application scenarios of FRIS are analyzed and discussed, followed by two illustrative case studies demonstrating its potential. Finally, the main open challenges and future research directions related to FRIS are highlighted.
Chinese: 流体可重构智能表面(FRIS)系统通过引入元件位置或辐射模式的动态可重构性,克服了传统RIS的局限,提供了更强的波束成形灵活性和环境适应性,本文全面综述了其当前进展与未来前景。
English: The fluid RIS (FRIS) system introduces dynamic reconfigurability in element positions or radiation patterns, overcoming the limitations of traditional RISs to provide enhanced beamforming flexibility and adaptability, with this paper offering a comprehensive overview of its current developments and future prospects.

Authors:Junsong Li, Jie Zhou, Bihao Zhan, Yutao Yang, Qianjun Pan, Shilian Chen, Tianyu Huai, Xin Li, Qin Chen, Liang He
Title: LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization
Abstract:
Alignment plays a crucial role in Large Language Models (LLMs) in aligning with human preferences on a specific task/domain. Traditional alignment methods suffer from catastrophic forgetting, where models lose previously acquired knowledge when adapting to new preferences or domains. We introduce LifeAlign, a novel framework for lifelong alignment that enables LLMs to maintain consistent human preference alignment across sequential learning tasks without forgetting previously learned knowledge. Our approach consists of two key innovations. First, we propose a focalized preference optimization strategy that aligns LLMs with new preferences while preventing the erosion of knowledge acquired from previous tasks. Second, we develop a short-to-long memory consolidation mechanism that merges denoised short-term preference representations into stable long-term memory using intrinsic dimensionality reduction, enabling efficient storage and retrieval of alignment patterns across diverse domains. We evaluate LifeAlign across multiple sequential alignment tasks spanning different domains and preference types. Experimental results demonstrate that our method achieves superior performance in maintaining both preference alignment quality and knowledge retention compared to existing lifelong learning approaches. The codes and datasets will be released on GitHub.
中文摘要:LifeAlign是一个终身对齐框架,通过聚焦偏好优化和短长期记忆整合机制,使大语言模型在持续学习过程中既能适应新任务偏好,又能保持已学知识不遗忘。
English Summary: LifeAlign is a lifelong alignment framework that prevents catastrophic forgetting in LLMs through focalized preference optimization and memory consolidation, enabling consistent human preference alignment across sequential tasks while preserving previously acquired knowledge.

Authors:Zhentian Zhang, Kai-Kit Wong, David Morales-Jimenez, Hao Jiang, Hao Xu, Christos Masouros, Zaichen Zhang, Chan-Byoung Chae
Title: Finite-blocklength Fluid Antenna Systems
Abstract:
This work introduces and investigates finite blocklength fluid antenna systems (FBL-FASs). To meet the stringent key performance indicators (KPIs) of 6G and beyond networks, including ultra-massive machine-type communications (mMTC), ultra-reliable low-latency communications (URLLC), and enhanced mobile broadband (eMBB), it is necessary to evaluate the performance of FAS under limited channel uses across time, frequency, and other domains. By exploiting random matrix theory and extreme value theory (EVT), we characterize the effect of finite blocklength on key metrics such as the signal-to-noise ratio (SNR) and the signal-to-interference-plus-noise ratio (SINR), via accurate estimation of interference caused by codeword correlation. Closed-form expressions for block error rate (BLER) and outage probability are derived, covering both conditional BLER (with channel state information, CSI) and statistical BLER (without CSI). The proposed analysis leverages Chernoff bounds and introduces a Taylor-expansion-assisted mean value theorem for integrals (MVTI) to reduce computational complexity. Numerical results show that, compared with conventional multi-antenna systems, the proposed FBL-FAS framework achieves higher energy and spectral efficiency under finite blocklength, making it a promising enabler for next-generation wireless networks.
本研究介绍了有限块长流体天线系统(FBL-FAS),并通过先进数学理论分析其性能,证明其在下一代无线网络中相比传统多天线系统具有更高的能量和频谱效率。
This study introduces finite blocklength fluid antenna systems (FBL-FAS) and analyzes their performance using advanced mathematical theories, demonstrating superior energy and spectral efficiency over traditional multi-antenna systems for next-generation wireless networks.

Authors:Tong Zhang, Qianren Li, Shuai Wang, Wanli Ni, Jiliang Zhang, Rui Wang, Kai-Kit Wong, Chan-Byoung Chae
Title: Indoor Fluid Antenna Systems Enabled by Layout-Specific Modeling and Group Relative Policy Optimization
Abstract:
The fluid antenna system (FAS) revolutionizes wireless communications by employing position-flexible antennas that dynamically optimize channel conditions and mitigate multipath fading. This innovation is particularly valuable in indoor environments, where signal propagation is severely degraded due to structural obstructions and complex multipath reflections. In this paper, we study the channel modeling and joint optimization of antenna positioning, beamforming, and power allocation for indoor FAS. In particular, we propose, for the first time, a layout-specific channel model and a novel group relative policy optimization (GRPO) algorithm for indoor FAS. Compared to the state-of-the-art Sionna model, our approach achieves an $83.3\%$ reduction in computation time with an approximately $3$ dB increase in root-mean-square error (RMSE). When simplified to a two-ray model, our channel model enables a closed-form solution for the optimal antenna position, achieving near-optimal performance. {For the joint optimization problem, the proposed GRPO algorithm outperforms proximal policy optimization (PPO) and other baselines in sum-rate, while requiring only 49.2\% computational resources of PPO, due to its group-based advantage estimation.} Simulation results reveal that increasing either the group size or trajectory length in GRPO does not yield significant improvements in sum-rate, suggesting that these parameters can be selected conservatively without sacrificing performance.
Chinese Summary: 本文针对室内流体天线系统提出了布局特定的信道模型和新型GRPO算法,在联合优化天线定位、波束成形和功率分配方面实现了显著的计算效率与接近最优的性能。
English Summary: The paper introduces a layout-specific channel model and a novel GRPO algorithm for indoor fluid antenna systems, achieving significant computational efficiency and near-optimal performance in joint optimization of antenna positioning, beamforming, and power allocation.

Authors:Monica Sekoyan, Nithin Rao Koluguri, Nune Tadevosyan, Piotr Zelasko, Travis Bartley, Nikolay Karpov, Jagadeesh Balam, Boris Ginsburg
Title: Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST
Abstract:
This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.
Chinese: Canary-1B-v2是一款快速、鲁棒的多语言语音识别与翻译模型,在英语ASR上超越Whisper-large-v3且快10倍,并在25种语言中与更大模型相比展现出竞争力。
English: Canary-1B-v2 is a fast and robust multilingual model for speech recognition and translation, outperforming Whisper-large-v3 in English ASR with 10x speed and delivering competitive results against larger models across 25 languages.

Authors:Zhaoyang Chu, Yao Wan, Zhikun Zhang, Di Wang, Zhou Yang, Hongyu Zhang, Pan Zhou, Xuanhua Shi, Hai Jin, David Lo
Title: Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning
Abstract:
While Code Language Models (CLMs) have demonstrated superior performance in software engineering tasks such as code generation and summarization, recent empirical studies reveal a critical privacy vulnerability: these models exhibit unintended memorization of sensitive training data, enabling verbatim reproduction of confidential information when specifically prompted. To address this issue, several approaches, including training data de-duplication and differential privacy augmentation, have been proposed. However, these methods require full-model retraining for deployed CLMs, which incurs substantial computational costs. In this paper, we aim to answer the following research question: Can sensitive information memorized by CLMs be erased effectively and efficiently? We conduct a pioneering investigation into erasing sensitive memorization in CLMs through machine unlearning - a post-hoc modification method that removes specific information from trained models without requiring full retraining. Specifically, we first quantify the memorization risks of sensitive data within CLM training datasets and curate a high-risk dataset of 50,000 sensitive memorized samples as unlearning targets. We study two widely used gradient ascent-based unlearning approaches: the vanilla and constraint-based methods, and introduce CodeEraser, an advanced variant that selectively unlearns sensitive memorized segments in code while preserving the structural integrity and functional correctness of the surrounding code. Extensive experiments on three families of CLMs, i.e., CodeParrot, CodeGen-Mono, and Qwen2.5-Coder, validate the effectiveness and efficiency of CodeEraser in erasing targeted sensitive memorization while maintaining model utility.
中文摘要:代码语言模型存在记忆敏感训练数据的风险,而现有隐私保护方法需全模型重训练;本文提出CodeEraser这一高效机器遗忘方法,能选择性消除敏感记忆同时保持模型功能完整性。
English Summary: Code Language Models (CLMs) risk memorizing sensitive training data, but existing privacy solutions require costly full-model retraining; this paper introduces CodeEraser, an efficient machine unlearning method that selectively removes sensitive memorization while preserving model functionality.

Authors:Bihao Zhan, Jie Zhou, Junsong Li, Yutao Yang, Shilian Chen, Qianjun Pan, Xin Li, Wen Wu, Xingjiao Wu, Qin Chen, Hang Yan, Liang He
Title: Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning
Abstract:
Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges due to accumulating diverse information. Traditional privacy methods, like a uniform Differential Privacy (DP) budget, indiscriminately protect all data, leading to substantial model utility degradation and hindering CL deployment in privacy-sensitive areas. To overcome this, we propose a privacy-enhanced continual learning (PeCL) framework that forgets what's sensitive and remembers what matters. Our approach first introduces a token-level dynamic Differential Privacy strategy that adaptively allocates privacy budgets based on the semantic sensitivity of individual tokens. This ensures robust protection for private entities while minimizing noise injection for non-sensitive, general knowledge. Second, we integrate a privacy-guided memory sculpting module. This module leverages the sensitivity analysis from our dynamic DP mechanism to intelligently forget sensitive information from the model's memory and parameters, while explicitly preserving the task-invariant historical knowledge crucial for mitigating catastrophic forgetting. Extensive experiments show that PeCL achieves a superior balance between privacy preserving and model utility, outperforming baseline models by maintaining high accuracy on previous tasks while ensuring robust privacy.
中文: 提出的隐私增强持续学习(PeCL)框架通过引入令牌级动态差分隐私策略和隐私引导记忆优化模块,在实验中实现了敏感信息保护与知识保留的最佳平衡,显著提升了隐私安全与模型效用的协同表现。
English: The proposed privacy-enhanced continual learning (PeCL) framework introduces a token-level dynamic differential privacy strategy and a privacy-guided memory sculpting module to effectively protect sensitive information while preserving essential knowledge, achieving superior privacy-utility balance in experiments.

Authors:Shilian Chen, Jie Zhou, Tianyu Huai, Yujiang Lu, Junsong Li, Bihao Zhan, Qianjun Pan, Yutao Yang, Xin Li, Qin Chen, Hang Yan, Liang He
Title: Black-box Model Merging for Language-Model-as-a-Service with Massive Model Repositories
Abstract:
Model merging refers to the process of integrating multiple distinct models into a unified model that preserves and combines the strengths and capabilities of the individual models. Most existing approaches rely on task vectors to combine models, typically under the assumption that model parameters are accessible. However, for extremely large language models (LLMs) such as GPT-4, which are often provided solely as black-box services through API interfaces (Language-Model-as-a-Service), model weights are not available to end users. This presents a significant challenge, which we refer to as black-box model merging (BMM) with massive LLMs. To address this challenge, we propose a derivative-free optimization framework based on the evolutionary algorithm (Evo-Merging) that enables effective model merging using only inference-time API queries. Our method consists of two key components: (1) sparsity-based denoising, designed to identify and filter out irrelevant or redundant information across models, and (2) sign-aware scaling, which dynamically computes optimal combination weights for the relevant models based on their performance. We also provide a formal justification, along with a theoretical analysis, for our asymmetric sparsification. Extensive experimental evaluations demonstrate that our approach achieves state-of-the-art results on a range of tasks, significantly outperforming existing strong baselines.
中文摘要:本文提出Evo-Merging框架,通过基于进化算法的无导数优化方法,仅利用推理API即可实现黑盒大语言模型的融合,其稀疏化去噪和符号感知缩放技术在多任务中取得了最优性能。
English Summary: This paper introduces Evo-Merging, a derivative-free optimization framework using evolutionary algorithms to merge black-box large language models through API queries alone, featuring sparsity-based denoising and sign-aware scaling to achieve state-of-the-art performance across various tasks.

Authors:Yougen Zhou, Qin Chen, Ningning Zhou, Jie Zhou, Xingjiao Wu, Liang He
Title: Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations
Abstract:
Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.
中文: 本研究揭示了大型语言模型在情感支持对话策略规划中存在偏好偏差的根本原因在于其知识边界限制,并提出了一种基于双重奖励的强化学习方法,通过优化策略规划的准确性和置信度有效缓解了该偏差,实验结果表明该方法优于现有基线模型。
English: This study identifies the knowledge boundaries of large language models as the root cause of their preference bias in emotional support conversation strategy planning and proposes a dual-reward reinforcement learning approach that effectively mitigates this bias by optimizing both accuracy and confidence, with experimental results demonstrating superior performance over existing methods.

Authors:Timothy Rupprecht, Enfu Nan, Arash Akbari, Arman Akbari, Lei Lu, Priyanka Maan, Sean Duffy, Pu Zhao, Yumei He, David Kaeli, Yanzhi Wang
Title: RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing
Abstract:
Role-playing Large language models (LLMs) are increasingly deployed in high-stakes domains such as healthcare, education, and governance, where failures can directly impact user trust and well-being. A cost effective paradigm for LLM role-playing is few-shot learning, but existing approaches often cause models to break character in unexpected and potentially harmful ways, especially when interacting with hostile users. Inspired by Retrieval-Augmented Generation (RAG), we reformulate LLM role-playing into a text retrieval problem and propose a new prompting framework called RAGs-to-Riches, which leverages curated reference demonstrations to condition LLM responses. We evaluate our framework with LLM-as-a-judge preference voting and introduce two novel token-level ROUGE metrics: Intersection over Output (IOO) to quantity how much an LLM improvises and Intersection over References (IOR) to measure few-shot demonstrations utilization rate during the evaluation tasks. When simulating interactions with a hostile user, our prompting strategy incorporates in its responses during inference an average of 35% more tokens from the reference demonstrations. As a result, across 453 role-playing interactions, our models are consistently judged as being more authentic, and remain in-character more often than zero-shot and in-context Learning (ICL) methods. Our method presents a scalable strategy for building robust, human-aligned LLM role-playing frameworks.
中文摘要:RAGs-to-Riches框架通过整合精选参考示范来增强大语言模型的角色扮演能力,相比传统方法,在对抗性交互中能保持更高角色一致性和真实性。
English Summary: The proposed RAGs-to-Riches framework enhances LLM role-playing by incorporating curated reference demonstrations, resulting in more authentic character portrayal and improved resilience against hostile interactions compared to standard methods.

Authors:Jiaxuan Zhao, Naibin Gu, Yuchen Feng, Xiyu Liu, Peng Fu, Zheng Lin, Weiping Wang
Title: CBP-Tuning: Efficient Local Customization for Black-box Large Language Models
Abstract:
The high costs of customizing large language models (LLMs) fundamentally limit their adaptability to user-specific needs. Consequently, LLMs are increasingly offered as cloud-based services, a paradigm that introduces critical limitations: providers struggle to support personalized customization at scale, while users face privacy risks when exposing sensitive data. To address this dual challenge, we propose Customized Black-box Prompt Tuning (CBP-Tuning), a novel framework that facilitates efficient local customization while preserving bidirectional privacy. Specifically, we design a two-stage framework: (1) a prompt generator trained on the server-side to capture domain-specific and task-agnostic capabilities, and (2) user-side gradient-free optimization that tailors soft prompts for individual tasks. This approach eliminates the need for users to access model weights or upload private data, requiring only a single customized vector per task while achieving effective adaptation. Furthermore, the evaluation of CBP-Tuning in the commonsense reasoning, medical and financial domain settings demonstrates superior performance compared to baselines, showcasing its advantages in task-agnostic processing and privacy preservation.
中文: 针对大语言模型定制成本高和隐私风险的问题,定制化黑箱提示调优框架通过两阶段设计实现高效本地适配,无需用户共享数据或模型参数,在多项领域测试中展现出优越性能。
English: To overcome the high costs and privacy concerns of customizing large language models, the Customized Black-box Prompt Tuning framework enables efficient, private local customization through a two-stage process that avoids exposing user data or model weights.

Authors:Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, Yanzhi Wang
Title: Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs
Abstract:
Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic control, yet their performance scaling across model architectures and hardware platforms, as well as their associated power budgets, remain poorly understood. This work presents an evaluation of five representative VLA models -- spanning state-of-the-art baselines and two newly proposed architectures -- targeting edge and datacenter GPU platforms. Using the LIBERO benchmark, we measure accuracy alongside system-level metrics, including latency, throughput, and peak memory usage, under varying edge power constraints and high-performance datacenter GPU configurations. Our results identify distinct scaling trends: (1) architectural choices, such as action tokenization and model backbone size, strongly influence throughput and memory footprint; (2) power-constrained edge devices exhibit non-linear performance degradation, with some configurations matching or exceeding older datacenter GPUs; and (3) high-throughput variants can be achieved without significant accuracy loss. These findings provide actionable insights when selecting and optimizing VLAs across a range of deployment constraints. Our work challenges current assumptions about the superiority of datacenter hardware for robotic inference.
中文: 本研究评估了五种视觉-语言-动作模型,发现架构选择和功耗限制显著影响性能,部分边缘设备在保持精度的同时可媲美数据中心GPU。
English: This study evaluates five Vision-Language-Action models, revealing that architectural choices and power constraints significantly impact performance, with some edge devices rivaling datacenter GPUs while maintaining accuracy.

Authors:Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen
Title: Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications
Abstract:
Large Language Models (LLMs) have demonstrated significant potential in medicine. To date, LLMs have been widely applied to tasks such as diagnostic assistance, medical question answering, and clinical information synthesis. However, a key open question remains: to what extent do LLMs memorize medical training data. In this study, we present the first comprehensive evaluation of memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than reported in the general domain. Memorization affects both the development and adoption of LLMs in medicine and can be categorized into three types: beneficial (e.g., accurate recall of clinical guidelines and biomedical references), uninformative (e.g., repeated disclaimers or templated medical document language), and harmful (e.g., regeneration of dataset-specific or sensitive clinical content). Based on these findings, we offer practical recommendations to facilitate beneficial memorization that enhances domain-specific reasoning and factual accuracy, minimize uninformative memorization to promote deeper learning beyond surface-level patterns, and mitigate harmful memorization to prevent the leakage of sensitive or identifiable patient information.
中文: 本研究首次全面评估医学大语言模型的记忆现象,揭示其在各类适应场景中普遍存在,并将记忆影响分为有益、无意义和有害三类,同时提出了优化记忆效果的实用建议。
English: This study provides the first comprehensive evaluation of memorization in medical Large Language Models, revealing its prevalence across adaptation scenarios and categorizing its impacts as beneficial, uninformative, or harmful, with practical recommendations offered for optimization.

Authors:Chuang Jiang, Mingyue Cheng, Xiaoyu Tao, Qingyang Mao, Jie Ouyang, Qi Liu
Title: TableMind: An Autonomous Programmatic Agent for Tool-Augmented Table Reasoning
Abstract:
Table reasoning is crucial for leveraging structured data in domains such as finance, healthcare, and scientific research. While large language models (LLMs) show promise in multi-step reasoning, purely text-based methods often struggle with the complex numerical computations and fine-grained operations inherently required in this task. Tool-integrated reasoning improves computational accuracy via explicit code execution, yet existing systems frequently rely on rigid patterns, supervised imitation, and lack true autonomous adaptability. In this paper, we present TableMind, an LLM-driven table reasoning agent that (i) autonomously performs multi-turn tool invocation, (ii) writes and executes data-analyzing code in a secure sandbox environment for data analysis and precise numerical reasoning, and (iii) exhibits high-level capabilities such as planning and self-reflection to adapt strategies. To realize these capabilities, we adopt a two-stage fine-tuning paradigm built on top of a powerful pre-trained language model: supervised fine-tuning on high-quality reasoning trajectories to establish effective tool usage patterns, followed by reinforcement fine-tuning to optimize multi-objective strategies. In particular, we propose Rank-Aware Policy Optimization (RAPO), which increases the update weight of high-quality trajectories when their output probabilities are lower than those of low-quality ones, thereby guiding the model more consistently toward better and more accurate answers. Extensive experiments on several mainstream benchmarks demonstrate that TableMind achieves superior performance compared to competitive baselines, yielding substantial gains in both reasoning accuracy and computational precision.
中文摘要:TableMind是一种基于大语言模型的智能表格推理代理,通过自主工具调用、安全沙箱代码执行及策略自适应,结合两阶段微调方法显著提升了结构化数据推理的准确性和计算精度。
English Summary: TableMind is an advanced LLM-driven agent that autonomously performs multi-step table reasoning through tool integration, code execution, and strategic planning, achieving superior accuracy via a novel two-stage fine-tuning approach.

Authors:Yiming Yao, Fei Liu, Liang Zhao, Xi Lin, Qingfu Zhang
Title: FoMEMO: Towards Foundation Models for Expensive Multi-objective Optimization
Abstract:
Expensive multi-objective optimization is a prevalent and crucial concern in many real-world scenarios, where sample-efficiency is vital due to the limited evaluations to recover the true Pareto front for decision making. Existing works either involve rebuilding Gaussian process surrogates from scratch for each objective in each new problem encountered, or rely on extensive past domain experiments for pre-training deep learning models, making them hard to generalize and impractical to cope with various emerging applications in the real world. To address this issue, we propose a new paradigm named FoMEMO (Foundation Models for Expensive Multi-objective Optimization), which enables the establishment of a foundation model conditioned on any domain trajectory and user preference, and facilitates fast in-context optimization based on the predicted preference-wise aggregation posteriors. Rather than accessing extensive domain experiments in the real world, we demonstrate that pre-training the foundation model with a diverse set of hundreds of millions of synthetic data can lead to superior adaptability to unknown problems, without necessitating any subsequent model training or updates in the optimization process. We evaluate our method across a variety of synthetic benchmarks and real-word applications, and demonstrate its superior generality and competitive performance compared to existing methods.
中文: 提出的FoMEMO范式利用基于海量合成数据预训练的基础模型,无需领域实验或模型更新即可实现高效、自适应的多目标优化,在各类基准测试和实际应用中展现出卓越的泛化能力和竞争优势。
English: The proposed FoMEMO paradigm utilizes a foundation model pre-trained on extensive synthetic data to enable efficient and adaptable multi-objective optimization without requiring domain-specific experiments or model updates, demonstrating superior generality and performance across various benchmarks and real-world applications.

Authors:Yougen Zhou, Ningning Zhou, Qin Chen, Jie Zhou, Aimin Zhou, Liang He
Title: DiaCBT: A Long-Periodic Dialogue Corpus Guided by Cognitive Conceptualization Diagram for CBT-based Psychological Counseling
Abstract:
Psychotherapy reaches only a small fraction of individuals suffering from mental disorders due to social stigma and the limited availability of therapists. Large language models (LLMs), when equipped with professional psychotherapeutic skills, offer a promising solution to expand access to mental health services. However, the lack of psychological conversation datasets presents significant challenges in developing effective psychotherapy-guided conversational agents. In this paper, we construct a long-periodic dialogue corpus for counseling based on cognitive behavioral therapy (CBT). Our curated dataset includes multiple sessions for each counseling and incorporates cognitive conceptualization diagrams (CCDs) to guide client simulation across diverse scenarios. To evaluate the utility of our dataset, we train an in-depth counseling model and present a comprehensive evaluation framework to benchmark it against established psychological criteria for CBT-based counseling. Results demonstrate that DiaCBT effectively enhances LLMs' ability to emulate psychologists with CBT expertise, underscoring its potential for training more professional counseling agents.
中文摘要:由于社会污名和治疗师资源有限,心理治疗仅能惠及少数精神障碍患者,而基于认知行为疗法构建的对话数据集DiaCBT能有效增强大语言模型的专业心理咨询能力,为拓展心理健康服务提供新途径。
English Summary: Psychotherapy access is limited by stigma and therapist shortages, but large language models trained with cognitive behavioral therapy datasets like DiaCBT show promise in expanding mental health services by effectively emulating professional counseling.

Authors:Guyue Hu, Siyuan Song, Jingpeng Sun, Zhe Jin, Chenglong Li, Jin Tang
Title: Mix-modal Federated Learning for MRI Image Segmentation
Abstract:
Magnetic resonance imaging (MRI) image segmentation is crucial in diagnosing and treating many diseases, such as brain tumors. Existing MRI image segmentation methods mainly fall into a centralized multimodal paradigm, which is inapplicable in engineering non-centralized mix-modal medical scenarios. In this situation, each distributed client (hospital) processes multiple mixed MRI modalities, and the modality set and image data for each client are diverse, suffering from extensive client-wise modality heterogeneity and data heterogeneity. In this paper, we first formulate non-centralized mix-modal MRI image segmentation as a new paradigm for federated learning (FL) that involves multiple modalities, called mix-modal federated learning (MixMFL). It distinguishes from existing multimodal federating learning (MulMFL) and cross-modal federating learning (CroMFL) paradigms. Then, we proposed a novel modality decoupling and memorizing mix-modal federated learning framework (MDM-MixMFL) for MRI image segmentation, which is characterized by a modality decoupling strategy and a modality memorizing mechanism. Specifically, the modality decoupling strategy disentangles each modality into modality-tailored and modality-shared information. During mix-modal federated updating, corresponding modality encoders undergo tailored and shared updating, respectively. It facilitates stable and adaptive federating aggregation of heterogeneous data and modalities from distributed clients. Besides, the modality memorizing mechanism stores client-shared modality prototypes dynamically refreshed from every modality-tailored encoder to compensate for incomplete modalities in each local client. It further benefits modality aggregation and fusion processes during mixmodal federated learning. Extensive experiments on two public datasets for MRI image segmentation demonstrate the effectiveness and superiority of our methods.
中文: 本文针对分布式混合模态MRI分割中的客户端异质性问题,提出了新型混合模态联邦学习范式及模态解耦记忆框架,通过模态定制化处理和动态原型补偿机制,在实验中展现出卓越的有效性和优越性。
English: This paper introduces a novel mix-modal federated learning (MixMFL) paradigm and a modality decoupling and memorizing framework (MDM-MixMFL) to address client-wise heterogeneity in non-centralized MRI segmentation, demonstrating superior performance through tailored modality processing and dynamic prototype compensation.

Authors:Junda He, Zhou Yang, Jieke Shi, Chengran Yang, Kisub Kim, Bowen Xu, Xin Zhou, David Lo
Title: Curiosity-Driven Testing for Sequential Decision-Making Process
Abstract:
Sequential decision-making processes (SDPs) are fundamental for complex real-world challenges, such as autonomous driving, robotic control, and traffic management. While recent advances in Deep Learning (DL) have led to mature solutions for solving these complex problems, SDMs remain vulnerable to learning unsafe behaviors, posing significant risks in safety-critical applications. However, developing a testing framework for SDMs that can identify a diverse set of crash-triggering scenarios remains an open challenge. To address this, we propose CureFuzz, a novel curiosity-driven black-box fuzz testing approach for SDMs. CureFuzz proposes a curiosity mechanism that allows a fuzzer to effectively explore novel and diverse scenarios, leading to improved detection of crashtriggering scenarios. Additionally, we introduce a multi-objective seed selection technique to balance the exploration of novel scenarios and the generation of crash-triggering scenarios, thereby optimizing the fuzzing process. We evaluate CureFuzz on various SDMs and experimental results demonstrate that CureFuzz outperforms the state-of-the-art method by a substantial margin in the total number of faults and distinct types of crash-triggering scenarios. We also demonstrate that the crash-triggering scenarios found by CureFuzz can repair SDMs, highlighting CureFuzz as a valuable tool for testing SDMs and optimizing their performance.
中文:CureFuzz提出了一种基于好奇心的模糊测试方法,用于顺序决策系统,能更有效地发现多样化的崩溃场景,在故障识别和模型修复方面显著优于现有技术。
English: CureFuzz introduces a curiosity-driven fuzz testing approach for sequential decision-making systems, enhancing the detection of diverse crash scenarios and outperforming existing methods in fault identification and model repair.

Authors:Jiawei Cao, Jie Ouyang, Zhaomeng Zhou, Mingyue Cheng, Yupeng Li, Jiaxian Yan, Qi Liu
Title: Re3: Learning to Balance Relevance & Recency for Temporal Information Retrieval
Abstract:
Temporal Information Retrieval (TIR) is a critical yet unresolved task for modern search systems, retrieving documents that not only satisfy a query's information need but also adhere to its temporal constraints. This task is shaped by two challenges: Relevance, ensuring alignment with the query's explicit temporal requirements, and Recency, selecting the freshest document among multiple versions. Existing methods often address the two challenges in isolation, relying on brittle heuristics that fail in scenarios where temporal requirements and staleness resistance are intertwined. To address this gap, we introduce Re2Bench, a benchmark specifically designed to disentangle and evaluate Relevance, Recency, and their hybrid combination. Building on this foundation, we propose Re3, a unified and lightweight framework that dynamically balances semantic and temporal information through a query-aware gating mechanism. On Re2Bench, Re3 achieves state-of-the-art results, leading in R@1 across all three subsets. Ablation studies with backbone sensitivity tests confirm robustness, showing strong generalization across diverse encoders and real-world settings. This work provides both a generalizable solution and a principled evaluation suite, advancing the development of temporally aware retrieval systems. Re3 and Re2Bench are available online: https://anonymous.4open.science/r/Re3-0C5A
中文: 本研究提出了Re2Bench基准,用于评估信息检索中的时间相关性和时效性,并设计了Re3统一框架,通过动态平衡语义和时间信息实现了最优性能。
English: The study introduces Re2Bench, a benchmark for evaluating temporal relevance and recency in information retrieval, and proposes Re3, a unified framework that dynamically balances semantic and temporal information to achieve state-of-the-art performance.

Authors:Aditya Kasliwal, Franziska Boenisch, Adam Dziedzic
Title: Localizing and Mitigating Memorization in Image Autoregressive Models
Abstract:
Image AutoRegressive (IAR) models have achieved state-of-the-art performance in speed and quality of generated images. However, they also raise concerns about memorization of their training data and its implications for privacy. This work explores where and how such memorization occurs within different image autoregressive architectures by measuring a fine-grained memorization. The analysis reveals that memorization patterns differ across various architectures of IARs. In hierarchical per-resolution architectures, it tends to emerge early and deepen with resolutions, while in IARs with standard autoregressive per token prediction, it concentrates in later processing stages. These localization of memorization patterns are further connected to IARs' ability to memorize and leak training data. By intervening on their most memorizing components, we significantly reduce the capacity for data extraction from IARs with minimal impact on the quality of generated images. These findings offer new insights into the internal behavior of image generative models and point toward practical strategies for mitigating privacy risks.
Chinese: 本研究探讨了不同图像自回归架构中训练数据记忆的发生机制,揭示了各模型间不同的记忆模式,并通过针对性干预在保持图像质量的同时有效降低了数据泄露风险。
English: This study investigates how memorization of training data occurs in different Image AutoRegressive (IAR) architectures, revealing distinct patterns across models and enabling targeted interventions that reduce data extraction risks while preserving image quality.

Authors:Yushuo Chen, Ruizhi Shao, Youxin Pang, Hongwen Zhang, Xinyi Wu, Rihui Wu, Yebin Liu
Title: DevilSight: Augmenting Monocular Human Avatar Reconstruction through a Virtual Perspective
Abstract:
We present a novel framework to reconstruct human avatars from monocular videos. Recent approaches have struggled either to capture the fine-grained dynamic details from the input or to generate plausible details at novel viewpoints, which mainly stem from the limited representational capacity of the avatar model and insufficient observational data. To overcome these challenges, we propose to leverage the advanced video generative model, Human4DiT, to generate the human motions from alternative perspective as an additional supervision signal. This approach not only enriches the details in previously unseen regions but also effectively regularizes the avatar representation to mitigate artifacts. Furthermore, we introduce two complementary strategies to enhance video generation: To ensure consistent reproduction of human motion, we inject the physical identity into the model through video fine-tuning. For higher-resolution outputs with finer details, a patch-based denoising algorithm is employed. Experimental results demonstrate that our method outperforms recent state-of-the-art approaches and validate the effectiveness of our proposed strategies.
中文: 本文提出一种新颖框架,通过利用Human4DiT视频生成模型从不同视角生成人体运动作为额外监督信号,结合物理身份注入和基于块的去噪策略,有效增强单目视频中人体虚拟形象的细节重建并减少伪影。
English: This paper introduces a novel framework that reconstructs human avatars from monocular videos by leveraging the Human4DiT video generative model to generate motions from alternative perspectives, enhancing detail and reducing artifacts through additional supervision and complementary strategies like physical identity injection and patch-based denoising.

Authors:Zhihong Chen, Xuehai Bai, Yang Shi, Chaoyou Fu, Huanyu Zhang, Haotian Wang, Xiaoyan Sun, Zhang Zhang, Liang Wang, Yuanxing Zhang, Pengfei Wan, Yi-Fan Zhang
Title: OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing
Abstract:
The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.
中文: OpenGPT-4o-Image数据集通过分层任务分类和自动生成8万对多样化指令-图像对,解决了多模态训练数据的系统性不足,显著提升了模型在编辑和生成任务中的性能表现。
English: The OpenGPT-4o-Image dataset addresses limitations in multimodal training data by introducing a hierarchical taxonomy and automated generation of 80k diverse instruction-image pairs, significantly boosting model performance in both editing and generation tasks.

Authors:Yixuan Li, Xinyi Liu, Weidong Yang, Ben Fei, Shuhao Li, Mingjie Zhou, Lipeng Ma
Title: PseudoBridge: Pseudo Code as the Bridge for Better Semantic and Logic Alignment in Code Retrieval
Abstract:
Code search aims to precisely find relevant code snippets that match natural language queries within massive codebases, playing a vital role in software development. Recent advances leverage pre-trained language models (PLMs) to bridge the semantic gap between unstructured natural language (NL) and structured programming languages (PL), yielding significant improvements over traditional information retrieval and early deep learning approaches. However, existing PLM-based methods still encounter key challenges, including a fundamental semantic gap between human intent and machine execution logic, as well as limited robustness to diverse code styles. To address these issues, we propose PseudoBridge, a novel code retrieval framework that introduces pseudo-code as an intermediate, semi-structured modality to better align NL semantics with PL logic. Specifically, PseudoBridge consists of two stages. First, we employ an advanced large language model (LLM) to synthesize pseudo-code, enabling explicit alignment between NL queries and pseudo-code. Second, we introduce a logic-invariant code style augmentation strategy and employ the LLM to generate stylistically diverse yet logically equivalent code implementations with pseudo-code, then align the code snippets of different styles with pseudo-code, enhancing model robustness to code style variation. We build PseudoBridge across 10 different PLMs and evaluate it on 6 mainstream programming languages. Extensive experiments demonstrate that PseudoBridge consistently outperforms baselines, achieving significant gains in retrieval accuracy and generalization, particularly under zero-shot domain transfer scenarios such as Solidity and XLCoST datasets. These results demonstrate the effectiveness of explicit logical alignment via pseudo-code and highlight PseudoBridge's potential as a robust, generalizable solution for code retrieval.
中文: PseudoBridge通过引入伪代码作为中间模态来弥合自然语言查询与编程语言之间的语义鸿沟,采用显式对齐和代码风格增强的两阶段方法,显著提升了多种编程语言中的检索准确性和鲁棒性。
English: PseudoBridge introduces pseudo-code as an intermediate modality to bridge the semantic gap between natural language queries and programming languages, employing a two-stage process of explicit alignment and code style augmentation to significantly enhance retrieval accuracy and robustness across diverse programming languages.

Authors:Lipeng Ma, Yixuan Li, Weidong Yang, Mingjie Zhou, Xinyi Liu, Ben Fei, Shuhao Li, Xiaoyan Sun, Sihang Jiang, Yanghua Xiao
Title: LogReasoner: Empowering LLMs with Expert-like Coarse-to-Fine Reasoning for Log Analysis Tasks
Abstract:
Log analysis is crucial for monitoring system health and diagnosing failures in complex systems. Recent advances in large language models (LLMs) offer new opportunities for automated log analysis, leveraging their reasoning capabilities to perform tasks such as anomaly detection and failure prediction. However, general-purpose LLMs struggle to formulate structured reasoning workflows that align with expert cognition and deliver precise details of reasoning steps. To address these challenges, we propose LogReasoner, a coarse-to-fine reasoning enhancement framework designed to enable LLMs to reason log analysis tasks like experts. LogReasoner consists of two stages: (1) coarse-grained enhancement of expert thinking, where high-level expert thoughts are constructed from collected troubleshooting flowcharts and existing tasks to enable LLMs to formulate structured reasoning workflows and (2) fine-grained enhancement of specific steps, where we first fine-tune the LLM with task-specific stepwise solutions to enhance the LLM for instantiated reasoning, then employ the preference learning to calibrate the LLM's reasoning details from its mistakes, further strengthen the LLM's analytical granularity and correctness. We evaluate LogReasoner on four distinct log analysis tasks using open-source LLMs such as Qwen-2.5 and Llama-3. Experimental results show that LogReasoner significantly outperforms existing LLMs, achieving state-of-the-art performance and demonstrating its effectiveness in enhancing the reasoning capabilities of LLMs for log analysis.
中文: LogReasoner通过从粗到精的推理增强框架,使大语言模型能够像专家一样分析日志,在多项任务中实现了最先进的性能表现。
English: LogReasoner is a novel framework that enhances large language models for expert-level log analysis through coarse-to-fine reasoning, achieving state-of-the-art performance across multiple tasks.

Authors:Lipeng Ma, Yixuan Li, Weidong Yang, Mingjie Zhou, Xinyi Liu, Ben Fei, Shuhao Li, Xiaoyan Sun, Sihang Jiang, Yanghua Xiao
Title: LogReasoner: Empowering LLMs with Expert-like Coarse-to-Fine Reasoning for Automated Log Analysis
Abstract:
Log analysis is crucial for monitoring system health and diagnosing failures in complex systems. Recent advances in large language models (LLMs) offer new opportunities for automated log analysis, leveraging their reasoning capabilities to perform tasks such as anomaly detection and failure prediction. However, general-purpose LLMs struggle to formulate structured reasoning workflows that align with expert cognition and deliver precise details of reasoning steps. To address these challenges, we propose LogReasoner, a coarse-to-fine reasoning enhancement framework designed to enable LLMs to reason log analysis tasks like experts. LogReasoner consists of two stages: (1) coarse-grained enhancement of expert thinking, where high-level expert thoughts are constructed from collected troubleshooting flowcharts and existing tasks to enable LLMs to formulate structured reasoning workflows and (2) fine-grained enhancement of specific steps, where we first fine-tune the LLM with task-specific stepwise solutions to enhance the LLM for instantiated reasoning, then employ the preference learning to calibrate the LLM's reasoning details from its mistakes, further strengthen the LLM's analytical granularity and correctness. We evaluate LogReasoner on four distinct log analysis tasks using open-source LLMs such as Qwen-2.5 and Llama-3. Experimental results show that LogReasoner significantly outperforms existing LLMs, achieving state-of-the-art performance and demonstrating its effectiveness in enhancing the reasoning capabilities of LLMs for log analysis.
中文: LogReasoner通过从粗到精的推理增强框架,使大语言模型能够像专家一样分析日志,在多项任务中实现了最先进的性能表现。
English: LogReasoner is a novel framework that enhances large language models for expert-level log analysis through coarse-to-fine reasoning, achieving state-of-the-art performance across multiple tasks.

Authors:Xinwei Zhang, Haibo Hu, Qingqing Ye, Li Bai, Huadi Zheng
Title: MER-Inspector: Assessing model extraction risks from an attack-agnostic perspective
Abstract:
Information leakage issues in machine learning-based Web applications have attracted increasing attention. While the risk of data privacy leakage has been rigorously analyzed, the theory of model function leakage, known as Model Extraction Attacks (MEAs), has not been well studied. In this paper, we are the first to understand MEAs theoretically from an attack-agnostic perspective and to propose analytical metrics for evaluating model extraction risks. By using the Neural Tangent Kernel (NTK) theory, we formulate the linearized MEA as a regularized kernel classification problem and then derive the fidelity gap and generalization error bounds of the attack performance. Based on these theoretical analyses, we propose a new theoretical metric called Model Recovery Complexity (MRC), which measures the distance of weight changes between the victim and surrogate models to quantify risk. Additionally, we find that victim model accuracy, which shows a strong positive correlation with model extraction risk, can serve as an empirical metric. By integrating these two metrics, we propose a framework, namely Model Extraction Risk Inspector (MER-Inspector), to compare the extraction risks of models under different model architectures by utilizing relative metric values. We conduct extensive experiments on 16 model architectures and 5 datasets. The experimental results demonstrate that the proposed metrics have a high correlation with model extraction risks, and MER-Inspector can accurately compare the extraction risks of any two models with up to 89.58%.
中文: 本文提出了评估机器学习中模型提取风险的理论框架,通过理论分析与实证指标相结合,经大量实验验证能有效比较不同模型架构的脆弱性。
English: This paper introduces a theoretical framework for assessing model extraction risks in machine learning, proposing both analytical and empirical metrics that are validated through extensive experiments to effectively compare vulnerabilities across different model architectures.

Authors:Fengyuan Liu, Rui Zhao, Shuo Chen, Guohao Li, Philip Torr, Lei Han, Jindong Gu
Title: Can an Individual Manipulate the Collective Decisions of Multi-Agents?
Abstract:
Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system's collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.
中文: 本研究提出M-Spoiler框架,通过模拟顽固智能体交互生成对抗样本,证明即使攻击者仅掌握单个智能体信息,也能误导多智能体系统的集体决策,揭示了系统安全风险。
English: This study introduces M-Spoiler, a framework that generates adversarial samples by simulating stubborn agent interactions to exploit vulnerabilities in multi-agent systems, demonstrating how knowledge of just one agent can mislead collective decisions despite incomplete information.

Authors:Yi-Fan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Haotian Wang, Kai Wu, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, Liang Wang
Title: BaseReward: A Strong Baseline for Multimodal Reward Model
Abstract:
The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe'' for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including \textit{reward modeling paradigms} (e.g., Naive-RM, Critic-based RM, and Generative RM), \textit{reward head architecture}, \textit{training strategies}, \textit{data curation} (covering over ten multimodal and text-only preference datasets), \textit{backbone model} and \textit{model scale}, and \textit{ensemble methods}. Based on these experimental insights, we introduce \textbf{BaseReward}, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM's performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.
中文: 本文通过研究多模态奖励模型开发中的关键组件,提供了构建高性能模型的系统性指南,并推出了BaseReward这一新型基准模型,该模型在各项基准测试和实际应用中均展现出最先进的性能表现。
English: This paper provides a systematic guide for constructing high-performance Multimodal Reward Models (MRMs) by investigating key development components and introduces BaseReward, a new state-of-the-art baseline that demonstrates superior performance on benchmarks and practical applications.

Authors:Jihua Peng, Qianxiong Xu, Yichen Liu, Chenxi Liu, Cheng Long, Rui Zhao, Ziyue Li
Title: Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model
Abstract:
Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level token and multiple cluster-specific tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the token and tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the token's ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM's hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.
中文: 提出的LIR-GAD框架通过融合多模态大语言模型与专用标记及视觉特征,增强了群体活动检测的性能和可解释性,实现了更精准的上下文推理。
English: The proposed LIR-GAD framework enhances group activity detection by integrating multimodal large language models with specialized tokens and visual features, improving both performance and explainability through contextual reasoning.

Authors:Sungho Lee, Marco Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji
Title: Reverse Engineering of Music Mixing Graphs with Differentiable Processors and Iterative Pruning
Abstract:
Reverse engineering of music mixes aims to uncover how dry source signals are processed and combined to produce a final mix. We extend the prior works to reflect the compositional nature of mixing and search for a graph of audio processors. First, we construct a mixing console, applying all available processors to every track and subgroup. With differentiable processor implementations, we optimize their parameters with gradient descent. Then, we repeat the process of removing negligible processors and fine-tuning the remaining ones. This way, the quality of the full mixing console can be preserved while removing approximately two-thirds of the processors. The proposed method can be used not only to analyze individual music mixes but also to collect large-scale graph data that can be used for downstream tasks, e.g., automatic mixing. Especially for the latter purpose, efficient implementation of the search is crucial. To this end, we present an efficient batch-processing method that computes multiple processors in parallel. We also exploit the "dry/wet" parameter of the processors to accelerate the search. Extensive quantitative and qualitative analyses are conducted to evaluate the proposed method's performance, behavior, and computational cost.
中文: 本研究通过构建带可微分处理器的混音台,采用梯度下降优化参数并高效移除冗余处理器,在保持音质的同时推进了音乐混音逆向工程,既能分析混音作品又能为自动混音等下游任务收集大规模数据。
English: This study advances reverse engineering of music mixes by constructing a mixing console with differentiable processors, optimizing parameters through gradient descent, and efficiently removing unnecessary processors while preserving quality, enabling both mix analysis and large-scale data collection for automatic mixing applications.

Authors:Guorui Chen, Yifan Xia, Xiaojun Jia, Zhijiang Li, Philip Torr, Jindong Gu
Title: LLM Jailbreak Detection for (Almost) Free!
Abstract:
Large language models (LLMs) enhance security through alignment when widely used, but remain susceptible to jailbreak attacks capable of producing inappropriate content. Jailbreak detection methods show promise in mitigating jailbreak attacks through the assistance of other models or multiple model inferences. However, existing methods entail significant computational costs. In this paper, we first present a finding that the difference in output distributions between jailbreak and benign prompts can be employed for detecting jailbreak prompts. Based on this finding, we propose a Free Jailbreak Detection (FJD) which prepends an affirmative instruction to the input and scales the logits by temperature to further distinguish between jailbreak and benign prompts through the confidence of the first token. Furthermore, we enhance the detection performance of FJD through the integration of virtual instruction learning. Extensive experiments on aligned LLMs show that our FJD can effectively detect jailbreak prompts with almost no additional computational costs during LLM inference.
中文摘要:大语言模型易受越狱攻击,而本文提出的免费越狱检测方法通过分析输出分布差异和首个令牌置信度,能以极低计算成本有效识别此类威胁。
English Summary: Large language models are vulnerable to jailbreak attacks, but the proposed Free Jailbreak Detection method effectively identifies these threats by analyzing output distribution differences and first-token confidence with minimal computational overhead.

Authors:Hanqi Zhu, Wuyang Zhang, Xinran Zhang, Ziyang Tao, Xinrui Lin, Yu Zhang, Jianmin Ji, Yanyong Zhang
Title: UrgenGo: Urgency-Aware Transparent GPU Kernel Launching for Autonomous Driving
Abstract:
The rapid advancements in autonomous driving have introduced increasingly complex, real-time GPU-bound tasks critical for reliable vehicle operation. However, the proprietary nature of these autonomous systems and closed-source GPU drivers hinder fine-grained control over GPU executions, often resulting in missed deadlines that compromise vehicle performance. To address this, we present UrgenGo, a non-intrusive, urgency-aware GPU scheduling system that operates without access to application source code. UrgenGo implicitly prioritizes GPU executions through transparent kernel launch manipulation, employing task-level stream binding, delayed kernel launching, and batched kernel launch synchronization. We conducted extensive real-world evaluations in collaboration with a self-driving startup, developing 11 GPU-bound task chains for a realistic autonomous navigation application and implementing our system on a self-driving bus. Our results show a significant 61% reduction in the overall deadline miss ratio, compared to the state-of-the-art GPU scheduler that requires source code modifications.
中文: UrgenGo是一种非侵入式、紧急感知的GPU调度系统,通过透明内核操作在不需访问源代码的情况下,将自动驾驶任务的截止期限错过率显著降低了61%。
English: UrgenGo is a non-intrusive, urgency-aware GPU scheduling system that enhances autonomous driving performance by reducing deadline misses by 61% through transparent kernel manipulation, without requiring source code access.

Authors:George Ciubotariu, Zhuyun Zhou, Zongwei Wu, Radu Timofte
Title: MIORe & VAR-MIORe: Benchmarks to Push the Boundaries of Restoration
Abstract:
We introduce MIORe and VAR-MIORe, two novel multi-task datasets that address critical limitations in current motion restoration benchmarks. Designed with high-frame-rate (1000 FPS) acquisition and professional-grade optics, our datasets capture a broad spectrum of motion scenarios, which include complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects. By adaptively averaging frames based on computed optical flow metrics, MIORe generates consistent motion blur, and preserves sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark to offer explicit control over motion amplitude. We provide high-resolution, scalable ground truths that challenge existing algorithms under both controlled and adverse conditions, paving the way for next-generation research of various image and video restoration tasks.
中文摘要:MIORe和VAR-MIORe是新型多任务数据集,通过高帧率采集技术解决现有运动恢复基准的不足,具备可控运动模糊和可变运动幅度特性,为图像和视频修复研究提供先进基准。
English Summary: MIORe and VAR-MIORe are novel multi-task datasets designed with high-frame-rate capture to address limitations in motion restoration benchmarks, featuring controlled motion blur and variable motion magnitudes for advanced image and video restoration research.

Authors:Chenming He, Rui Xia, Chengzhen Meng, Xiaoran Fan, Dequan Wang, Haojie Ren, Jianmin Ji, Yanyong Zhang
Title: Ghost Points Matter: Far-Range Vehicle Detection with a Single mmWave Radar in Tunnel
Abstract:
Vehicle detection in tunnels is crucial for traffic monitoring and accident response, yet remains underexplored. In this paper, we develop mmTunnel, a millimeter-wave radar system that achieves far-range vehicle detection in tunnels. The main challenge here is coping with ghost points caused by multi-path reflections, which lead to severe localization errors and false alarms. Instead of merely removing ghost points, we propose correcting them to true vehicle positions by recovering their signal reflection paths, thus reserving more data points and improving detection performance, even in occlusion scenarios. However, recovering complex 3D reflection paths from limited 2D radar points is highly challenging. To address this problem, we develop a multi-path ray tracing algorithm that leverages the ground plane constraint and identifies the most probable reflection path based on signal path loss and spatial distance. We also introduce a curve-to-plane segmentation method to simplify tunnel surface modeling such that we can significantly reduce the computational delay and achieve real-time processing. We have evaluated mmTunnel with comprehensive experiments. In two test tunnels, we conducted controlled experiments in various scenarios with cars and trucks. Our system achieves an average F1 score of 93.7% for vehicle detection while maintaining real-time processing. Even in the challenging occlusion scenarios, the F1 score remains above 91%. Moreover, we collected extensive data from a public tunnel with heavy traffic at times and show our method could achieve an F1 score of 91.5% in real-world traffic conditions.
Chinese: 本文提出毫米波雷达系统mmTunnel,通过多路径射线追踪算法将隧道中的多径反射虚警点校正为真实车辆位置,即使在遮挡场景下也能实现高精度检测和实时处理。
English: This paper introduces mmTunnel, a millimeter-wave radar system that overcomes ghost points from multi-path reflections in tunnels by correcting them to true vehicle positions using a multi-path ray tracing algorithm, achieving high detection accuracy and real-time processing even in occlusion scenarios.

Authors:Qiqi Zhan, Shiwei Li, Qingjie Liu, Yunhong Wang
Title: AttriPrompt: Dynamic Prompt Composition Learning for CLIP
Abstract:
The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37\% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.
中文:AttriPrompt是一种创新框架,通过利用CLIP视觉特征实现内容感知的提示自适应和双流对比学习的细粒度对齐,解决了深度文本提示的局限性,在视觉语言任务中取得了显著性能提升。
English: AttriPrompt is a novel framework that addresses limitations in deep text prompting by leveraging CLIP's visual features to enable content-aware prompt adaptation and fine-grained alignment through dual-stream contrastive learning, achieving significant performance improvements in vision-language tasks.

Authors:Huifeng Lin, Gang Su, Jintao Liang, You Wu, Rui Zhao, Ziyue Li
Title: Fishing for Answers: Exploring One-shot vs. Iterative Retrieval Strategies for Retrieval Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) based on Large Language Models (LLMs) is a powerful solution to understand and query the industry's closed-source documents. However, basic RAG often struggles with complex QA tasks in legal and regulatory domains, particularly when dealing with numerous government documents. The top-$k$ strategy frequently misses golden chunks, leading to incomplete or inaccurate answers. To address these retrieval bottlenecks, we explore two strategies to improve evidence coverage and answer quality. The first is a One-SHOT retrieval method that adaptively selects chunks based on a token budget, allowing as much relevant content as possible to be included within the model's context window. Additionally, we design modules to further filter and refine the chunks. The second is an iterative retrieval strategy built on a Reasoning Agentic RAG framework, where a reasoning LLM dynamically issues search queries, evaluates retrieved results, and progressively refines the context over multiple turns. We identify query drift and retrieval laziness issues and further design two modules to tackle them. Through extensive experiments on a dataset of government documents, we aim to offer practical insights and guidance for real-world applications in legal and regulatory domains.
Chinese Summary: 本研究针对基础检索增强生成在处理复杂法律监管查询时的局限性,提出了自适应单次检索和迭代式智能检索两种优化策略,通过改进证据覆盖和答案质量来解决实际应用中的瓶颈问题。
English Summary: This study addresses the limitations of basic Retrieval-Augmented Generation (RAG) in handling complex legal and regulatory queries by proposing two enhanced strategies—adaptive one-shot retrieval and iterative agentic retrieval—to improve evidence coverage and answer accuracy.

Authors:Weicao Deng, Sangwoo Park, Min Li, Osvaldo Simeone
Title: Optimizing In-Context Learning for Efficient Full Conformal Prediction
Abstract:
Reliable uncertainty quantification is critical for trustworthy AI. Conformal Prediction (CP) provides prediction sets with distribution-free coverage guarantees, but its two main variants face complementary limitations. Split CP (SCP) suffers from data inefficiency due to dataset partitioning, while full CP (FCP) improves data efficiency at the cost of prohibitive retraining complexity. Recent approaches based on meta-learning or in-context learning (ICL) partially mitigate these drawbacks. However, they rely on training procedures not specifically tailored to CP, which may yield large prediction sets. We introduce an efficient FCP framework, termed enhanced ICL-based FCP (E-ICL+FCP), which employs a permutation-invariant Transformer-based ICL model trained with a CP-aware loss. By simulating the multiple retrained models required by FCP without actual retraining, E-ICL+FCP preserves coverage while markedly reducing both inefficiency and computational overhead. Experiments on synthetic and real tasks demonstrate that E-ICL+FCP attains superior efficiency-coverage trade-offs compared to existing SCP and FCP baselines.
Chinese: 本文提出E-ICL+FCP增强型保形预测框架,通过采用基于Transformer的元学习模型和CP感知损失训练,无需实际重训练即可模拟多个重训练模型,在保持覆盖率的显著提升了效率并优于现有方法。
English: This paper introduces E-ICL+FCP, an enhanced conformal prediction framework that uses a Transformer-based model trained with a CP-aware loss to efficiently simulate multiple retrained models, achieving better efficiency-coverage trade-offs than existing methods.

Authors:Ziyu Chen, Junfei Sun, Chenxi Li, Tuan Dung Nguyen, Jing Yao, Xiaoyuan Yi, Xing Xie, Chenhao Tan, Lexing Xie
Title: MoVa: Towards Generalizable Classification of Human Morals and Values
Abstract:
Identifying human morals and values embedded in language is essential to empirical studies of communication. However, researchers often face substantial difficulty navigating the diversity of theoretical frameworks and data available for their analysis. Here, we contribute MoVa, a well-documented suite of resources for generalizable classification of human morals and values, consisting of (1) 16 labeled datasets and benchmarking results from four theoretically-grounded frameworks; (2) a lightweight LLM prompting strategy that outperforms fine-tuned models across multiple domains and frameworks; and (3) a new application that helps evaluate psychological surveys. In practice, we specifically recommend a classification strategy, all@once, that scores all related concepts simultaneously, resembling the well-known multi-label classifier chain. The data and methods in MoVa can facilitate many fine-grained interpretations of human and machine communication, with potential implications for the alignment of machine behavior.
中文摘要:MoVa提供了一套包含标注数据集、高效大语言模型提示方法和调查评估工具的资源套件,用于对人类道德价值观进行分类,有助于跨领域分析和机器行为对齐研究。
English Summary: MoVa provides a comprehensive toolkit for classifying human morals and values through labeled datasets, an efficient LLM prompting method, and a survey evaluation tool, enhancing cross-domain analysis and machine communication alignment.

Authors:Zikang Tian, Shaohui Peng, Du Huang, Jiaming Guo, Ruizhi Chen, Rui Zhang, Xishan Zhang, Yuxuan Guo, Zidong Du, Qi Guo, Ling Li, Yewen Pu, Xing Hu, Yunji Chen
Title: Code Driven Planning with Domain-Adaptive Critic
Abstract:
Large Language Models (LLMs) have been widely adopted as task planners for AI agents in sequential decision-making problems, leveraging their extensive world knowledge. However, the gap between their general knowledge and environment-specific requirements often leads to inaccurate plans. To address this, existing approaches rely on frequent LLM queries to iteratively refine plans based on immediate environmental feedback, which incurs substantial query costs. However, this refinement is typically guided by short-term environmental feedback, limiting LLMs from developing plans aligned with long-term rewards. We propose Code Driven Planning with Domain-Adaptive Critic (CoPiC). Instead of relying on frequent queries, CoPiC employs LLMs to generate a diverse set of high-level planning programs, which iteratively produce and refine candidate plans. A trained domain-adaptive critic then evaluates these candidates and selects the one most aligned with long-term rewards for execution. Using high-level planning programs as planner and domain-adaptive critic as estimator, CoPiC improves planning while significantly reducing query costs. Results in ALFWorld, NetHack, and StarCraft II Unit Building show that CoPiC outperforms advanced LLM-based baselines, AdaPlanner and Reflexion, achieving an average (1) 23.33% improvement in success rate and (2) 91.27% reduction in query costs.
中文摘要:CoPiC通过大语言模型生成多样化规划程序,并利用领域自适应评估器筛选符合长期奖励的方案,从而显著提升任务成功率并大幅降低查询成本。
English Summary: CoPiC enhances AI planning by generating diverse programs with LLMs and using a domain-adaptive critic to select long-term reward-aligned plans, significantly improving success rates and reducing query costs.

Authors:Nikolaos Spanos, Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Athanasios Voulodimos, Giorgos Stamou
Title: V-CECE: Visual Counterfactual Explanations via Conceptual Edits
Abstract:
Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.
中文摘要:本文提出了一种即插即用的黑盒反事实生成框架,利用预训练扩散模型通过理论保证的最优编辑步骤生成人类级别的解释,无需训练即可揭示人类推理与神经网络行为之间的解释性差距。
English Summary: This paper introduces a plug-and-play black-box counterfactual generation framework that uses a pre-trained diffusion model to produce human-level explanations through theoretically guaranteed optimal edits, requiring no training while exposing the explanatory gap between human reasoning and neural models.

Authors:Xiaoyu Zhang, Weipeng Jiang, Juan Zhai, Shiqing Ma, Qingshuang Bao, Chenhao Lin, Chao Shen, Tianlin Li, Yang Liu
Title: Rethinking Technology Stack Selection with AI Coding Proficiency
Abstract:
Large language models (LLMs) are now an integral part of software development workflows and are reshaping the whole process. Traditional technology stack selection has not caught up. Most of the existing selection methods focus solely on the inherent attributes of the technology, overlooking whether the LLM can effectively leverage the chosen technology. For example, when generating code snippets using popular libraries like Selenium (one of the most widely used test automation tools with over 33k GitHub stars), existing LLMs frequently generate low-quality code snippets (e.g., using deprecated APIs and methods, or containing syntax errors). As such, teams using LLM assistants risk choosing technologies that cannot be used effectively by LLMs, yielding high debugging effort and mounting technical debt. We foresee a practical question in the LLM era, is a technology ready for AI-assisted development? In this paper, we first propose the concept, AI coding proficiency, the degree to which LLMs can utilize a given technology to generate high-quality code snippets. We conduct the first comprehensive empirical study examining AI proficiency across 170 third-party libraries and 61 task scenarios, evaluating six widely used LLMs. Our findings reveal that libraries with similar functionalities can exhibit up to 84% differences in the quality score of LLM-generated code, while different models also exhibit quality gaps among their generation results using the same library. These gaps translate into real engineering costs and can steer developer choices toward a narrow set of libraries with high AI coding proficiency, threatening technological diversity in the ecosystem. We call on the community to integrate AI proficiency assessments into technology selection frameworks and develop mitigation strategies, preserving competitive balance in AI-driven development.
中文: 传统技术选型方法忽视了AI编程熟练度,即大语言模型利用技术生成高质量代码的能力,导致高调试成本和技术多样性下降,亟需将AI熟练度评估纳入选型框架。
English: Traditional technology selection methods overlook AI coding proficiency—how well LLMs utilize technologies to generate quality code—leading to high debugging costs and reduced diversity, necessitating AI proficiency assessments in selection frameworks.

Authors:Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, Xudong Xu, Jiangmiao Pang
Title: InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts
Abstract:
The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.
中文摘要:InternScenes是一个包含4万场景的大规模可模拟室内数据集,通过整合多种来源并保留包含小物件的真实布局,解决了现有数据集的局限性,为场景生成和导航等高级应用提供了支持。
English Summary: InternScenes is a large-scale, simulatable indoor scene dataset with 40,000 diverse scenes that overcomes limitations of existing datasets by integrating multiple sources and preserving realistic layouts with small items, enabling advanced applications like scene generation and navigation.

Authors:Eoin O'Doherty, Nicole Weinrauch, Andrew Talone, Uri Klempner, Xiaoyuan Yi, Xing Xie, Yi Zeng
Title: The Morality of Probability: How Implicit Moral Biases in LLMs May Shape the Future of Human-AI Symbiosis
Abstract:
Artificial intelligence (AI) is advancing at a pace that raises urgent questions about how to align machine decision-making with human moral values. This working paper investigates how leading AI systems prioritize moral outcomes and what this reveals about the prospects for human-AI symbiosis. We address two central questions: (1) What moral values do state-of-the-art large language models (LLMs) implicitly favour when confronted with dilemmas? (2) How do differences in model architecture, cultural origin, and explainability affect these moral preferences? To explore these questions, we conduct a quantitative experiment with six LLMs, ranking and scoring outcomes across 18 dilemmas representing five moral frameworks. Our findings uncover strikingly consistent value biases. Across all models, Care and Virtue values outcomes were rated most moral, while libertarian choices were consistently penalized. Reasoning-enabled models exhibited greater sensitivity to context and provided richer explanations, whereas non-reasoning models produced more uniform but opaque judgments. This research makes three contributions: (i) Empirically, it delivers a large-scale comparison of moral reasoning across culturally distinct LLMs; (ii) Theoretically, it links probabilistic model behaviour with underlying value encodings; (iii) Practically, it highlights the need for explainability and cultural awareness as critical design principles to guide AI toward a transparent, aligned, and symbiotic future.
中文摘要:本研究探讨了先进人工智能系统在道德决策中的价值取向,发现各模型普遍偏向关怀与美德伦理框架而贬抑自由意志主义选择,并强调可解释性与文化认知对实现人机共生的重要性。
English Summary: This study examines how leading AI systems prioritize moral values in decision-making, revealing consistent biases favoring Care and Virtue frameworks while penalizing libertarian choices, and emphasizing the need for explainability and cultural awareness in AI development.

Authors:Qiyuan Chen, Jiahe Chen, Hongsen Huang, Qian Shao, Jintai Chen, Renjie Hua, Hongxia Xu, Ruijia Wu, Ren Chuan, Jian Wu
Title: Beyond Keywords: Driving Generative Search Engine Optimization with Content-Centric Agents
Abstract:
The paradigm shift from traditional ranked-based search to Generative Search Engines has rendered conventional SEO metrics obsolete, creating an urgent need to understand, measure, and optimize for content influence on synthesized answers. This paper introduces a comprehensive, end-to-end framework for Generative Search Engine Optimization (GSEO) to address this challenge. We make two primary contributions. First, we construct CC-GSEO-Bench, a large-scale, content-centric benchmark, and propose a multi-dimensional evaluation framework that systematically quantifies influence, moving beyond surface-level attribution to assess substantive semantic impact. Second, we design a novel multi-agent system that operationalizes this framework, automating the strategic refinement of content through a collaborative analyze-revise-evaluate workflow. Our empirical analysis using this framework reveals novel insights into the dynamics of content influence, offering actionable strategies for creators and establishing a principled foundation for future GSEO research.
中文摘要:本文提出了一套生成式搜索引擎优化(GSEO)整体框架,通过构建内容中心化基准和多智能体系统,量化内容影响力并实现内容自动优化,以应对传统搜索指标失效的挑战。
English Summary: This paper introduces a comprehensive Generative Search Engine Optimization (GSEO) framework featuring a content-centric benchmark and multi-agent system to quantify content influence and automate content refinement for synthesized answers.

Authors:Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu
Title: Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
Abstract:
Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs' representation space for efficient and tailored preference dataset construction, named Icon$^{2}$. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.
中文:Icon²方法通过利用大语言模型的表示空间编码人类偏好并引导响应生成,构建高质量偏好数据集,在显著提升对齐效果和效率的同时大幅降低了计算成本。
English: The proposed Icon² method constructs high-quality preference datasets by leveraging LLMs' representation space to encode human preferences and steer response generation, achieving significant improvements in alignment and efficiency with reduced computational costs.

Authors:Bangxiang Lan, Ruobing Xie, Ruixiang Zhao, Xingwu Sun, Zhanhui Kang, Gang Yang, Xirong Li
Title: Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval
Abstract:
The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6\% \sim 3.9\%$ in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.
中文: 本研究提出了一种用于文本到视频检索的混合塔式框架,通过名为PIG的新型伪查询交互方法,结合了双塔和单塔模型的优势,实现了高效性和高效能。
English: This study introduces a Hybrid-Tower framework for Text-to-Video Retrieval, combining the strengths of Two-Tower and Single-Tower models to achieve high effectiveness and efficiency through a novel pseudo-query interaction method called PIG.

Authors:Hengjie Cao, Mengyi Chen, Yifeng Yang, Ruijun Huang, Fang Dong, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, Fan Wu, Fan Yang, Tun Lu, Ning Gu, Li Shang
Title: Metis: Training LLMs with FP4 Quantization
Abstract:
This work identifies anisotropy in the singular value spectra of parameters, activations, and gradients as the fundamental barrier to low-bit training of large language models (LLMs). These spectra are dominated by a small fraction of large singular values, inducing wide numerical ranges that cause quantization bias and severe spectral distortion, ultimately degrading training performance. This work presents Metis, a spectral-domain quantization framework that partitions anisotropic spectra into narrower sub-distributions for independent quantization, thereby reducing errors and preserving spectral structure. To minimize overhead, Metis leverages two key properties of the dominant spectral subspace: preservation via sparsely random sampling and preservation via random projection, reducing decomposition cost to a negligible level. On LLaMA-3 8B trained with 100B tokens, Metis enables robust W4A4G4 training with FP4 quantization of weights, activations, and gradients, yielding only a 0.4% training loss gap and a 0.1% degradation in downstream accuracy relative to BF16. Beyond matching BF16 fidelity, Metis also surpasses our implementation of Nvidia's recently announced (yet to be publicly released) FP4 recipe, consistently achieving lower loss and higher downstream accuracy while incurring significantly lower computational overhead. The code implementation for Metis is available at: https://anonymous.4open.science/r/Metis-quantization-644B.
中文摘要:本研究识别出参数分布的各向异性是低比特量化训练的主要障碍,并提出Metis框架,通过谱分解、自适应学习率和双范围正则化技术,使FP4/FP8训练在保持稳定性的同时达到FP32基准性能。
English Summary: This study identifies anisotropic parameter distributions as a key obstacle in low-bit LLM quantization and introduces Metis, a training framework that employs spectral decomposition, adaptive learning rates, and dual-range regularization to enable stable FP4/FP8 training matching FP32 performance.

Authors:Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee
Title: YOLO26: Key Architectural Enhancements and Performance Benchmarking for Real-Time Object Detection
Abstract:
This study presents a comprehensive analysis of Ultralytics YOLO26, highlighting its key architectural enhancements and performance benchmarking for real-time object detection. YOLO26, released in September 2025, stands as the newest and most advanced member of the YOLO family, purpose-built to deliver efficiency, accuracy, and deployment readiness on edge and low-power devices. The paper sequentially details architectural innovations of YOLO26, including the removal of Distribution Focal Loss (DFL), adoption of end-to-end NMS-free inference, integration of ProgLoss and Small-Target-Aware Label Assignment (STAL), and the introduction of the MuSGD optimizer for stable convergence. Beyond architecture, the study positions YOLO26 as a multi-task framework, supporting object detection, instance segmentation, pose/keypoints estimation, oriented detection, and classification. We present performance benchmarks of YOLO26 on edge devices such as NVIDIA Jetson Nano and Orin, comparing its results with YOLOv8, YOLOv11, YOLOv12, YOLOv13, and transformer-based detectors(RF-DETR and RT-DETR). This paper further explores real-time deployment pathways, flexible export options (ONNX, TensorRT, CoreML, TFLite), and quantization for INT8/FP16. Practical use cases of YOLO26 across robotics, manufacturing, and IoT are highlighted to demonstrate cross-industry adaptability. Finally, insights on deployment efficiency and broader implications are discussed, with future directions for YOLO26 and the YOLO lineage outlined.
中文: 本研究介绍了YOLO26作为YOLO系列最新版本,通过架构创新实现边缘设备高效实时目标检测,通过与早期版本对比验证性能优势,并详述其跨行业部署应用场景。
English: This study introduces YOLO26 as the latest YOLO iteration with architectural innovations for efficient real-time object detection on edge devices, benchmarking its performance against predecessors and detailing deployment applications.

Authors:Xiangyu Zeng, Kefan Qiu, Qingyu Zhang, Xinhao Li, Jing Wang, Jiaxin Li, Ziang Yan, Kun Tian, Meng Tian, Xinhai Zhao, Yi Wang, Limin Wang
Title: StreamForest: Efficient Online Video Understanding with Persistent Event Memory
Abstract:
Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in video understanding. However, their effectiveness in real-time streaming scenarios remains limited due to storage constraints of historical visual features and insufficient real-time spatiotemporal reasoning. To address these challenges, we propose StreamForest, a novel architecture specifically designed for streaming video understanding. Central to StreamForest is the Persistent Event Memory Forest, a memory mechanism that adaptively organizes video frames into multiple event-level tree structures. This process is guided by penalty functions based on temporal distance, content similarity, and merge frequency, enabling efficient long-term memory retention under limited computational resources. To enhance real-time perception, we introduce a Fine-grained Spatiotemporal Window, which captures detailed short-term visual cues to improve current scene perception. Additionally, we present OnlineIT, an instruction-tuning dataset tailored for streaming video tasks. OnlineIT significantly boosts MLLM performance in both real-time perception and future prediction. To evaluate generalization in practical applications, we introduce ODV-Bench, a new benchmark focused on real-time streaming video understanding in autonomous driving scenarios. Experimental results demonstrate that StreamForest achieves the state-of-the-art performance, with accuracies of 77.3% on StreamingBench, 60.5% on OVBench, and 55.6% on OVO-Bench. In particular, even under extreme visual token compression (limited to 1024 tokens), the model retains 96.8% of its average accuracy in eight benchmarks relative to the default setting. These results underscore the robustness, efficiency, and generalizability of StreamForest for streaming video understanding.
中文摘要:StreamForest提出了一种新颖架构,通过持久事件记忆森林和细粒度时空窗口解决实时流视频理解中的存储与推理限制,在多个基准测试中实现最优性能,并在计算资源受限条件下保持高效率。
English Summary: StreamForest introduces a novel architecture with Persistent Event Memory Forest and Fine-grained Spatiotemporal Window to overcome limitations in real-time streaming video understanding, achieving state-of-the-art performance across multiple benchmarks while maintaining high efficiency under computational constraints.

Authors:Yuhan Zhu, Xiangyu Zeng, Chenting Wang, Xinhao Li, Yicheng Xu, Ziang Yan, Yi Wang, Limin Wang
Title: FreeRet: MLLMs as Training-Free Retrievers
Abstract:
Multimodal large language models (MLLMs) are emerging as versatile foundations for mixed-modality retrieval. Yet, they often require heavy post-hoc training to convert them into contrastive encoders for retrieval. This work asks: Can off-the-shelf MLLMs serve as powerful retrievers without additional training? We present FreeRet, a plug-and-play framework that turns any MLLM into a two-stage retriever. FreeRet first derives semantically grounded embeddings directly from the model for fast candidate search, and then exploits its reasoning ability for precise reranking. The framework contributes three advances: bypassing lexical alignment layers to obtain semantically faithful embeddings, conditioning representation generation with explicit priors, and mitigating framing effect in reranking via neutral choice framing. On the MMEB and MMEB-V2 benchmarks spanning 46 datasets, FreeRet substantially outperforms models trained on millions of pairs. Beyond benchmarks, FreeRet is model-agnostic and scales seamlessly across MLLM families and sizes, preserves their generative abilities, supports arbitrary modality combinations, and unifies retrieval, reranking, and generation into end-to-end RAG within a single model. Our findings demonstrate that pretrained MLLMs, when carefully harnessed, can serve as strong retrieval engines without training, closing a critical gap in their role as generalists.
中文: FreeRet 是一种即插即用框架,可将任何多模态大语言模型转化为两阶段检索器,无需额外训练即可实现高效候选搜索和精确重排序,并在单一模型中统一检索、重排序和生成功能。
English: FreeRet is a plug-and-play framework that transforms any multimodal large language model into a two-stage retriever, enabling efficient candidate search and precise reranking without additional training, while unifying retrieval, reranking, and generation within a single model.

Authors:Yang Bai, Haoran Cheng, Yang Zhou, Jun Zhou, Arun Thirunavukarasu, Yuhe Ke, Jie Yao, Kanae Fukutsu, Chrystie Wan Ning Quek, Ashley Hong, Laura Gutierrez, Zhen Ling Teo, Darren Shu Jeng Ting, Brian T. Soetikno, Christopher S. Nielsen, Tobias Elze, Zengxiang Li, Linh Le Dinh, Hiok Hong Chan, Victor Koh, Marcus Tan, Kelvin Z. Li, Leonard Yip, Ching Yu Cheng, Yih Chung Tham, Gavin Siew Wei Tan, Leopold Schmetterer, Marcus Ang, Rahat Hussain, Jod Mehta, Tin Aung, Lionel Tim-Ee Cheng, Tran Nguyen Tuan Anh, Chee Leong Cheng, Tien Yin Wong, Nan Liu, Iain Beehuat Tan, Soon Thye Lim, Eyal Klang, Tony Kiat Hon Lim, Rick Siow Mong Goh, Yong Liu, Daniel Shu Wei Ting
Title: EVLF-FM: Explainable Vision Language Foundation Model for Medicine
Abstract:
Despite the promise of foundation models in medical AI, current systems remain limited - they are modality-specific and lack transparent reasoning processes, hindering clinical adoption. To address this gap, we present EVLF-FM, a multimodal vision-language foundation model (VLM) designed to unify broad diagnostic capability with fine-grain explainability. The development and testing of EVLF-FM encompassed over 1.3 million total samples from 23 global datasets across eleven imaging modalities related to six clinical specialties: dermatology, hepatology, ophthalmology, pathology, pulmonology, and radiology. External validation employed 8,884 independent test samples from 10 additional datasets across five imaging modalities. Technically, EVLF-FM is developed to assist with multiple disease diagnosis and visual question answering with pixel-level visual grounding and reasoning capabilities. In internal validation for disease diagnostics, EVLF-FM achieved the highest average accuracy (0.858) and F1-score (0.797), outperforming leading generalist and specialist models. In medical visual grounding, EVLF-FM also achieved stellar performance across nine modalities with average mIOU of 0.743 and Acc@0.5 of 0.837. External validations further confirmed strong zero-shot and few-shot performance, with competitive F1-scores despite a smaller model size. Through a hybrid training strategy combining supervised and visual reinforcement fine-tuning, EVLF-FM not only achieves state-of-the-art accuracy but also exhibits step-by-step reasoning, aligning outputs with visual evidence. EVLF-FM is an early multi-disease VLM model with explainability and reasoning capabilities that could advance adoption of and trust in foundation models for real-world clinical deployment.
中文: EVLF-FM是一种多模态视觉语言基础模型,将广泛的诊断能力与精细可解释性相结合,在多种医学影像模态中实现了最先进的性能,并展现出强大的推理能力以促进临床应用。
English: EVLF-FM is a multimodal vision-language foundation model that unifies broad diagnostic capabilities with fine-grained explainability, achieving state-of-the-art performance across multiple medical imaging modalities and demonstrating strong reasoning abilities for clinical adoption.

Authors:Kuanrong Liu, Siyuan Liang, Cheng Qian, Ming Zhang, Xiaochun Cao
Title: Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives
Abstract:
As a general-purpose vision-language pretraining model, CLIP demonstrates strong generalization ability in image-text alignment tasks and has been widely adopted in downstream applications such as image classification and image-text retrieval. However, it struggles with fine-grained tasks such as object detection and semantic segmentation. While many variants aim to improve CLIP on these tasks, its robustness to adversarial perturbations remains underexplored. Understanding how adversarial examples transfer across tasks is key to assessing CLIP's generalization limits and security risks. In this work, we conduct a systematic empirical analysis of the cross-task transfer behavior of CLIP-based models on image-text retrieval, object detection, and semantic segmentation under adversarial perturbations. We find that adversarial examples generated from fine-grained tasks (e.g., object detection and semantic segmentation) often exhibit stronger transfer potential than those from coarse-grained tasks, enabling more effective attacks against the original CLIP model. Motivated by this observation, we propose a novel framework, Multi-Task Adversarial CLIP (MT-AdvCLIP), which introduces a task-aware feature aggregation loss and generates perturbations with enhanced cross-task generalization capability. This design strengthens the attack effectiveness of fine-grained task models on the shared CLIP backbone. Experimental results on multiple public datasets show that MT-AdvCLIP significantly improves the adversarial transfer success rate (The average attack success rate across multiple tasks is improved by over 39%.) against various CLIP-derived models, without increasing the perturbation budget. This study reveals the transfer mechanism of adversarial examples in multi-task CLIP models, offering new insights into multi-task robustness evaluation and adversarial example design.
Chinese: 本研究分析CLIP模型中的跨任务对抗样本迁移,发现细粒度任务生成的扰动具有更强迁移性,并提出MT-AdvCLIP框架,在不增加扰动预算的情况下将多任务攻击成功率平均提升39%以上。
English: This study analyzes cross-task adversarial transfer in CLIP models, revealing that fine-grained task perturbations transfer more effectively and proposing MT-AdvCLIP framework to enhance attack success rates by over 39% without extra perturbation costs.

Authors:Ziliang Hong, Halil Ertugrul Aktas, Andrea Mia Bejar, Katherine Wu, Hongyi Pan, Gorkem Durak, Zheyuan Zhang, Sait Kayali, Temel Tirkes, Federica Proietto Salanitri, Concetto Spampinato, Michael Goggins, Tamas Gonda, Candice Bolan, Raj Keswani, Frank Miller, Michael Wallace, Ulas Bagci
Title: Pancreas Part Segmentation under Federated Learning Paradigm
Abstract:
We present the first federated learning (FL) approach for pancreas part(head, body and tail) segmentation in MRI, addressing a critical clinical challenge as a significant innovation. Pancreatic diseases exhibit marked regional heterogeneity cancers predominantly occur in the head region while chronic pancreatitis causes tissue loss in the tail, making accurate segmentation of the organ into head, body, and tail regions essential for precise diagnosis and treatment planning. This segmentation task remains exceptionally challenging in MRI due to variable morphology, poor soft-tissue contrast, and anatomical variations across patients. Our novel contribution tackles two fundamental challenges: first, the technical complexity of pancreas part delineation in MRI, and second the data scarcity problem that has hindered prior approaches. We introduce a privacy-preserving FL framework that enables collaborative model training across seven medical institutions without direct data sharing, leveraging a diverse dataset of 711 T1W and 726 T2W MRI scans. Our key innovations include: (1) a systematic evaluation of three state-of-the-art segmentation architectures (U-Net, Attention U-Net,Swin UNETR) paired with two FL algorithms (FedAvg, FedProx), revealing Attention U-Net with FedAvg as optimal for pancreatic heterogeneity, which was never been done before; (2) a novel anatomically-informed loss function prioritizing region-specific texture contrasts in MRI. Comprehensive evaluation demonstrates that our approach achieves clinically viable performance despite training on distributed, heterogeneous datasets.
我们首次提出用于MRI胰腺部位分割的联邦学习框架,通过多机构协作不共享数据的方式解决数据稀缺和解剖复杂性难题,采用优化模型和专用损失函数实现了临床可用的性能。
We introduce the first federated learning framework for pancreas part segmentation in MRI, addressing data scarcity and anatomical complexity through multi-institutional collaboration without sharing data, achieving clinically viable results with optimized models and a specialized loss function.

Authors:Alec K. Peltekian, Karolina Senkow, Gorkem Durak, Kevin M. Grudzinski, Bradford C. Bemiss, Jane E. Dematte, Carrie Richardson, Nikolay S. Markov, Mary Carns, Kathleen Aren, Alexandra Soriano, Matthew Dapas, Harris Perlman, Aaron Gundersheimer, Kavitha C. Selvan, John Varga, Monique Hinchcliff, Krishnan Warrior, Catherine A. Gao, Richard G. Wunderink, GR Scott Budinger, Alok N. Choudhary, Anthony J. Esposito, Alexander V. Misharin, Ankit Agrawal, Ulas Bagci
Title: Imaging-Based Mortality Prediction in Patients with Systemic Sclerosis
Abstract:
Interstitial lung disease (ILD) is a leading cause of morbidity and mortality in systemic sclerosis (SSc). Chest computed tomography (CT) is the primary imaging modality for diagnosing and monitoring lung complications in SSc patients. However, its role in disease progression and mortality prediction has not yet been fully clarified. This study introduces a novel, large-scale longitudinal chest CT analysis framework that utilizes radiomics and deep learning to predict mortality associated with lung complications of SSc. We collected and analyzed 2,125 CT scans from SSc patients enrolled in the Northwestern Scleroderma Registry, conducting mortality analyses at one, three, and five years using advanced imaging analysis techniques. Death labels were assigned based on recorded deaths over the one-, three-, and five-year intervals, confirmed by expert physicians. In our dataset, 181, 326, and 428 of the 2,125 CT scans were from patients who died within one, three, and five years, respectively. Using ResNet-18, DenseNet-121, and Swin Transformer we use pre-trained models, and fine-tuned on 2,125 images of SSc patients. Models achieved an AUC of 0.769, 0.801, 0.709 for predicting mortality within one-, three-, and five-years, respectively. Our findings highlight the potential of both radiomics and deep learning computational methods to improve early detection and risk assessment of SSc-related interstitial lung disease, marking a significant advancement in the literature.
Chinese: 本研究开发了一种新型纵向胸部CT分析框架,结合影像组学和深度学习技术预测系统性硬化症间质性肺病患者的死亡率,在一、三和五年期预测中均表现出优异性能。
English: This study developed a novel longitudinal chest CT analysis framework using radiomics and deep learning to predict mortality in systemic sclerosis patients with interstitial lung disease, achieving strong predictive performance across one-, three-, and five-year intervals.

Authors:Tao Xiong, Xavier Hu, Yurun Chen, Yuhang Liu, Changqiao Wu, Pengzhi Gao, Wei Liu, Jian Luan, Shengyu Zhang
Title: GUI-PRA: Process Reward Agent for GUI Tasks
Abstract:
Graphical User Interface (GUI) Agents powered by Multimodal Large Language Models (MLLMs) show significant potential for automating tasks. However, they often struggle with long-horizon tasks, leading to frequent failures. Process Reward Models (PRMs) are a promising solution, as they can guide these agents with crucial process signals during inference. Nevertheless, their application to the GUI domain presents unique challenges. When processing dense artificial inputs with long history data, PRMs suffer from a "lost in the middle" phenomenon, where the overwhelming historical context compromises the evaluation of the current step. Furthermore, standard PRMs lacks GUI changing awareness, providing static evaluations that are disconnected from the dynamic consequences of actions, a critical mismatch with the inherently dynamic nature of GUI tasks. In response to these challenges, we introduce GUI-PRA (Process Reward Agent for GUI Tasks), a judge agent designed to better provide process reward than standard PRM by intelligently processing historical context and actively perceiving UI state changes. Specifically, to directly combat the ``lost in the middle'' phenomenon, we introduce a dynamic memory mechanism consisting of two core components: a Relevance-based Retrieval Module to actively fetch pertinent information from long histories and a Progressive Summarization Module to dynamically condense growing interaction data, ensuring the model focuses on relevant context. Moreover, to address the lack of UI changing awareness, we introduce an Aadaptive UI Perception mechanism. This mechanism enables the agent to reason about UI state changes and dynamically select the most appropriate tool to gather grounded visual evidence, ensuring its evaluation is always informed by the current UI context.
中文: GUI-PRA作为一种新型流程奖励智能体,通过动态记忆机制解决长历史数据中的“中间迷失”问题,并采用自适应界面感知技术实时捕捉界面状态变化,从而显著提升了图形界面任务中流程奖励评估的准确性。
English: GUI-PRA is a novel process reward agent that overcomes standard PRM limitations in GUI automation by dynamically managing long interaction histories through relevance-based retrieval and progressive summarization, while actively perceiving UI state changes for context-aware evaluations.

Authors:Jiawei Liang, Ruoyu Chen, Xianghao Jiao, Siyuan Liang, Shiming Liu, Qunli Zhang, Zheng Hu, Xiaochun Cao
Title: Explaining multimodal LLMs via intra-modal token interactions
Abstract:
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.
Chinese: 本文针对多模态大语言模型可解释性方法的不足,提出了多尺度解释聚合技术增强视觉连贯性和激活排序相关性技术优化文本关联性,在多种模型和数据集上实现了更忠实、更细粒度的行为解释。
English: This paper addresses the limitations of current interpretability methods for Multimodal Large Language Models by proposing two novel techniques—Multi-Scale Explanation Aggregation for visual coherence and Activation Ranking Correlation for textual relevance—that significantly enhance explanation fidelity and granularity across various models and datasets.

Authors:Wenqiang Wang, Siyuan Liang, Xiao Yan, Xiaochun Cao
Title: Text Adversarial Attacks with Dynamic Outputs
Abstract:
Text adversarial attack methods are typically designed for static scenarios with fixed numbers of output labels and a predefined label space, relying on extensive querying of the victim model (query-based attacks) or the surrogate model (transfer-based attacks). To address this gap, we introduce the Textual Dynamic Outputs Attack (TDOA) method, which employs a clustering-based surrogate model training approach to convert the dynamic-output scenario into a static single-output scenario. To improve attack effectiveness, we propose the farthest-label targeted attack strategy, which selects adversarial vectors that deviate most from the model's coarse-grained labels, thereby maximizing disruption. We extensively evaluate TDOA on four datasets and eight victim models (e.g., ChatGPT-4o, ChatGPT-4.1), showing its effectiveness in crafting adversarial examples and its strong potential to compromise large language models with limited access. With a single query per text, TDOA achieves a maximum attack success rate of 50.81\%. Additionally, we find that TDOA also achieves state-of-the-art performance in conventional static output scenarios, reaching a maximum ASR of 82.68\%. Meanwhile, by conceptualizing translation tasks as classification problems with unbounded output spaces, we extend the TDOA framework to generative settings, surpassing prior results by up to 0.64 RDBLEU and 0.62 RDchrF.
Chinese: 文本动态输出攻击(TDOA)方法通过聚类代理模型训练将动态输出场景转化为静态设置,并采用最远标签策略,在ChatGPT-4o等模型上以最少查询实现了高攻击成功率。
English: The Textual Dynamic Outputs Attack (TDOA) method effectively targets dynamic-output scenarios by converting them into static setups and using a farthest-label strategy, achieving high attack success rates on models like ChatGPT-4o with minimal queries.

Authors:Ilias Diakonikolas, Jelena Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas
Title: Linear Regression under Missing or Corrupted Coordinates
Abstract:
We study multivariate linear regression under Gaussian covariates in two settings, where data may be erased or corrupted by an adversary under a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an $η$-fraction of samples per coordinate; a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even information-theoretically. Unlike the clean setting, where estimation error vanishes with more samples, here the optimal error remains a positive function of the problem parameters. Our main contribution is to characterize this error up to constant factors across essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error that match the error of (computationally efficient) algorithms. A key implication is that, perhaps surprisingly, the optimal error in the missing data setting matches that in the corruption setting-so knowing the corruption locations offers no general advantage.
中文: 本研究确定了在对抗性数据删除或损坏下多元线性回归的最优误差界限,揭示了了解损坏位置并无普遍优势,因为两种设置产生的误差率完全相同。
English: This research characterizes the optimal error bounds for multivariate linear regression under adversarial data deletion or corruption, revealing that knowing corruption locations provides no general advantage as both settings yield identical error rates.

Authors:Haoqin Sun, Chenyang Lyu, Xiangyu Kong, Shiwan Zhao, Jiaming Zhou, Hui Wang, Aobo Kong, Jinghua Zhao, Longyue Wang, Weihua Luo, Kaifu Zhang, Yong Qin
Title: MECap-R1: Emotion-aware Policy with Reinforcement Learning for Multimodal Emotion Captioning
Abstract:
Speech Emotion Captioning (SEC) has emerged as a notable research direction. The inherent complexity of emotional content in human speech makes it challenging for traditional discrete classification methods to provide an adequate representation. Consequently, utilizing natural language to describe speech emotions presents a novel avenue for more effectively capturing and expressing affect. In this paper, we propose MECap-R1, a pioneering emotion-aware policy with reinforcement learning for multimodal emotion captioning. By employing Group Relative Policy Optimization with emotion-aware reward (Emo-GRPO), the framework precisely captures the emotion and semantic features, thereby addressing the shortcomings of rigid rules in handling the dynamic and flexible nature of captions. Experimental results on the EmotionTalk dataset demonstrate that MECap-R1 performs well in generating emotion descriptions and achieves substantial gains in both accuracy and diversity.
中文: MECap-R1采用情感感知强化学习策略,通过Emo-GRPO方法精准捕捉情感和语义特征,在EmotionTalk数据集上验证了其在生成准确多样情感描述方面的卓越性能。
English: MECap-R1 introduces a novel emotion-aware reinforcement learning policy for multimodal emotion captioning, effectively capturing emotional and semantic features to generate accurate and diverse descriptions, as validated on the EmotionTalk dataset.

Authors:Yeongbin Seo, Dongha Lee, Jinyoung Yeo
Title: Quantifying Self-Awareness of Knowledge in Large Language Models
Abstract:
Hallucination prediction in large language models (LLMs) is often interpreted as a sign of self-awareness. However, we argue that such performance can arise from question-side shortcuts rather than true model-side introspection. To disentangle these factors, we propose the Approximate Question-side Effect (AQE), which quantifies the contribution of question-awareness. Our analysis across multiple datasets reveals that much of the reported success stems from exploiting superficial patterns in questions. We further introduce SCAO (Semantic Compression by Answering in One word), a method that enhances the use of model-side signals. Experiments show that SCAO achieves strong and consistent performance, particularly in settings with reduced question-side cues, highlighting its effectiveness in fostering genuine self-awareness in LLMs.
中文: 该研究质疑大语言模型中的幻觉预测反映自我意识的观点,指出其源于问题侧捷径,并提出AQE指标和SCAO方法以强化模型真正的内省能力。
English: The study challenges the notion that hallucination prediction in LLMs reflects self-awareness, attributing it instead to question-side shortcuts and introducing the AQE metric and SCAO method to enhance genuine model introspection.

Authors:Yue Xin, Wenyuan Wang, Rui Pan, Ruida Wang, Howard Meng, Renjie Pi, Shizhe Diao, Tong Zhang
Title: Generalizable Geometric Image Caption Synthesis
Abstract:
Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models, yielding accuracy improvements of $2.8\%\text{-}4.8\%$ in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with $2.4\%\text{-}3.9\%$ improvements in Art, Design, Tech, and Engineering tasks in MMMU.
中文摘要:本文提出了一种带可验证奖励的强化学习(RLVR)方法,通过生成高质量几何训练数据来增强多模态大语言模型的几何推理能力,同时提升其在非几何任务上的表现。
English Summary: This paper introduces a Reinforcement Learning with Verifiable Rewards (RLVR) method to enhance multimodal large language models' geometric reasoning by generating high-quality training data, which also improves their performance on non-geometric tasks.

Authors:Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo
Title: Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
Abstract:
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks, but this sacrifices speed and bidirectionality, eliminating the main advantage of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.
中文: 自回归语言模型推理速度慢,扩散模型虽可并行解码但存在长解码窗口问题;我们提出的卷积解码和R2FT方法无需分割窗口即可提升生成质量,在保持双向优势的同时实现了最优性能与速度。
English: Autoregressive language models' slow inference is addressed by diffusion-based models, but they face the long decoding-window problem, which our proposed Convolutional decoding and R2FT methods overcome to achieve state-of-the-art generation quality and speed.

Authors:Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo
Title: Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
Abstract:
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks, but this sacrifices speed and bidirectionality, eliminating the main advantage of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.
中文: 自回归语言模型推理速度慢,扩散模型虽可并行解码但存在长解码窗口问题;我们提出的卷积解码和R2FT方法无需分割窗口即可提升生成质量,在保持双向优势的同时实现了最优性能与速度。
English: Autoregressive language models' slow inference is addressed by diffusion-based models, but they face the long decoding-window problem, which our proposed Convolutional decoding and R2FT methods overcome to achieve state-of-the-art generation quality and speed.

Authors:Yuxiao Lee, Xiaofeng Cao, Wei Ye, Jiangchao Yao, Jingkuan Song, Heng Tao Shen
Title: An Empirical Analysis of VLM-based OOD Detection: Mechanisms, Advantages, and Sensitivity
Abstract:
Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot out-of-distribution (OOD) detection capabilities, vital for reliable AI systems. Despite this promising capability, a comprehensive understanding of (1) why they work so effectively, (2) what advantages do they have over single-modal methods, and (3) how is their behavioral robustness -- remains notably incomplete within the research community. This paper presents a systematic empirical analysis of VLM-based OOD detection using in-distribution (ID) and OOD prompts. (1) Mechanisms: We systematically characterize and formalize key operational properties within the VLM embedding space that facilitate zero-shot OOD detection. (2) Advantages: We empirically quantify the superiority of these models over established single-modal approaches, attributing this distinct advantage to the VLM's capacity to leverage rich semantic novelty. (3) Sensitivity: We uncovers a significant and previously under-explored asymmetry in their robustness profile: while exhibiting resilience to common image noise, these VLM-based methods are highly sensitive to prompt phrasing. Our findings contribute a more structured understanding of the strengths and critical vulnerabilities inherent in VLM-based OOD detection, offering crucial, empirically-grounded guidance for developing more robust and reliable future designs.
中文: 视觉语言模型通过利用语义新颖性在零样本分布外检测中表现卓越,但尽管对图像噪声具有鲁棒性,却显示出对提示措辞的关键敏感性。
English: Vision-Language Models excel in zero-shot out-of-distribution detection by leveraging semantic novelty but reveal critical sensitivity to prompt phrasing despite robustness against image noise.

Authors:Zhefei Gong, Shangke Lyu, Pengxiang Ding, Wei Xiao, Donglin Wang
Title: Robust Online Residual Refinement via Koopman-Guided Dynamics Modeling
Abstract:
Imitation learning (IL) enables efficient skill acquisition from demonstrations but often struggles with long-horizon tasks and high-precision control due to compounding errors. Residual policy learning offers a promising, model-agnostic solution by refining a base policy through closed-loop corrections. However, existing approaches primarily focus on local corrections to the base policy, lacking a global understanding of state evolution, which limits robustness and generalization to unseen scenarios. To address this, we propose incorporating global dynamics modeling to guide residual policy updates. Specifically, we leverage Koopman operator theory to impose linear time-invariant structure in a learned latent space, enabling reliable state transitions and improved extrapolation for long-horizon prediction and unseen environments. We introduce KORR (Koopman-guided Online Residual Refinement), a simple yet effective framework that conditions residual corrections on Koopman-predicted latent states, enabling globally informed and stable action refinement. We evaluate KORR on long-horizon, fine-grained robotic furniture assembly tasks under various perturbations. Results demonstrate consistent gains in performance, robustness, and generalization over strong baselines. Our findings further highlight the potential of Koopman-based modeling to bridge modern learning methods with classical control theory.
中文摘要:本文提出KORR框架,通过结合基于Koopman算子的全局动力学建模来指导残差策略修正,显著提升了模仿学习在复杂机器人任务(如家具组装)中的性能与鲁棒性。
English Summary: This paper introduces KORR, a framework that enhances imitation learning for complex robotic tasks by integrating Koopman operator-based global dynamics modeling to guide residual policy corrections, demonstrating improved performance and robustness in furniture assembly scenarios.

Authors:Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, Donglin Wang
Title: TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning
Abstract:
Recent Vision-Language-Action models show potential to generalize across embodiments but struggle to quickly align with a new robot's action space when high-quality demonstrations are scarce, especially for bipedal humanoids. We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA. Our key idea is to use end-effector trajectories as a morphology-agnostic interface. TrajBooster (i) extracts 6D dual-arm end-effector trajectories from real-world wheeled humanoids, (ii) retargets them in simulation to Unitree G1 with a whole-body controller trained via a heuristic-enhanced harmonized online DAgger to lift low-dimensional trajectory references into feasible high-dimensional whole-body actions, and (iii) forms heterogeneous triplets that couple source vision/language with target humanoid-compatible actions to post-pre-train a VLA, followed by only 10 minutes of teleoperation data collection on the target humanoid domain. Deployed on Unitree G1, our policy achieves beyond-tabletop household tasks, enabling squatting, cross-height manipulation, and coordinated whole-body motion with markedly improved robustness and generalization. Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance, reducing reliance on costly same-embodiment data while enhancing action space understanding and zero-shot skill transfer capabilities. For more details, For more details, please refer to our \href{https://jiachengliu3.github.io/TrajBooster/}.
中文:TrajBooster是一种跨具身框架,通过重用轮式人形机器人的轨迹数据来增强双足机器人的性能,仅需少量目标领域演示即可实现稳健的全身运动。
English: TrajBooster is a cross-embodiment framework that enhances bipedal robot performance by repurposing wheeled-humanoid trajectory data, achieving robust whole-body motions with minimal target-domain demonstrations.

Authors:Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, Noah A. Smith
Title: Fluid Language Model Benchmarking
Abstract:
Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation. Although various strategies have been proposed to mitigate these issues, they tend to address individual aspects in isolation, neglecting broader questions about overall evaluation quality. Here, we introduce Fluid Benchmarking, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, Fluid Benchmarking is based on the insight that the relative value of benchmark items depends on an LM's capability level, suggesting that evaluation should adapt to each LM. Methodologically, Fluid Benchmarking estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education. In our experiments, we compare Fluid Benchmarking against the common practice of random item sampling as well as more sophisticated baselines, including alternative methods grounded in item response theory. We examine four dimensions -- efficiency, validity, variance, and saturation -- and find that Fluid Benchmarking achieves superior performance in all of them (e.g., higher validity and less variance on MMLU with fifty times fewer items). Our analysis shows that the two components of Fluid Benchmarking have distinct effects: item response theory, used to map performance into a latent ability space, increases validity, while dynamic item selection reduces variance. Overall, our results suggest that LM benchmarking can be substantially improved by moving beyond static evaluation.
中文: 流体基准测试是一种自适应语言模型评估方法,它结合项目反应理论和动态题目选择,显著提升了评估效率、效度并降低方差,优于传统静态基准。
English: Fluid Benchmarking is an adaptive evaluation method for language models that uses item response theory and dynamic item selection to enhance efficiency, validity, and reduce variance, outperforming traditional static benchmarks.

Authors:Jinzheng Zhao, Yong Xu, Haohe Liu, Davide Berghi, Xinyuan Qian, Qiuqiang Kong, Junqi Zhao, Mark D. Plumbley, Wenwu Wang
Title: Region-Specific Audio Tagging for Spatial Sound
Abstract:
Audio tagging aims to label sound events appearing in an audio recording. In this paper, we propose region-specific audio tagging, a new task which labels sound events in a given region for spatial audio recorded by a microphone array. The region can be specified as an angular space or a distance from the microphone. We first study the performance of different combinations of spectral, spatial, and position features. Then we extend state-of-the-art audio tagging systems such as pre-trained audio neural networks (PANNs) and audio spectrogram transformer (AST) to the proposed region-specific audio tagging task. Experimental results on both the simulated and the real datasets show the feasibility of the proposed task and the effectiveness of the proposed method. Further experiments show that incorporating the directional features is beneficial for omnidirectional tagging.
中文: 本文提出区域特定音频标注这一新任务,通过扩展先进音频模型并融合方向特征,实现了对麦克风阵列录音中指定空间区域内声音事件的精准标注,实验验证了该方法的可行性与有效性。
English: This paper introduces region-specific audio tagging, a novel task that labels sound events within defined spatial areas using microphone array recordings, and demonstrates its feasibility and effectiveness through extended state-of-the-art models and directional feature integration.

Authors:Ruoxuan Li, Xiaoyao Zhong, Jiabao Jin, Peng Cheng, Wangze Ni, Lei Chen, Zhitao Shen, Wei Jia, Xiangyu Wang, Xuemin Lin, Heng Tao Shen, Jingkuan Song
Title: SINDI: an Efficient Index for Approximate Maximum Inner Product Search on Sparse Vectors
Abstract:
Sparse vector Maximum Inner Product Search (MIPS) is crucial in multi-path retrieval for Retrieval-Augmented Generation (RAG). Recent inverted index-based and graph-based algorithms have achieved high search accuracy with practical efficiency. However, their performance in production environments is often limited by redundant distance computations and frequent random memory accesses. Furthermore, the compressed storage format of sparse vectors hinders the use of SIMD acceleration. In this paper, we propose the sparse inverted non-redundant distance index (SINDI), which incorporates three key optimizations: (i) Efficient Inner Product Computation: SINDI leverages SIMD acceleration and eliminates redundant identifier lookups, enabling batched inner product computation; (ii) Memory-Friendly Design: SINDI replaces random memory accesses to original vectors with sequential accesses to inverted lists, substantially reducing memory-bound latency. (iii) Vector Pruning: SINDI retains only the high-magnitude non-zero entries of vectors, improving query throughput while maintaining accuracy. We evaluate SINDI on multiple real-world datasets. Experimental results show that SINDI achieves state-of-the-art performance across datasets of varying scales, languages, and models. On the MsMarco dataset, when Recall@50 exceeds 99%, SINDI delivers single-thread query-per-second (QPS) improvements ranging from 4.2 to 26.4 times compared with SEISMIC and PyANNs. Notably, SINDI has been integrated into Ant Group's open-source vector search library, VSAG.
中文:本文提出SINDI优化索引,通过SIMD加速、顺序内存访问和向量剪枝三大创新,在保证精度的同时大幅提升稀疏向量检索效率,已在蚂蚁集团开源向量库VSAG中投入使用。
English: This paper introduces SINDI, an optimized sparse vector index that enhances retrieval efficiency in RAG systems by leveraging SIMD acceleration, sequential memory access, and vector pruning to achieve significant performance gains over existing methods.

Authors:Yuxing Liu, Yuze Ge, Rui Pan, An Kang, Tong Zhang
Title: Theoretical Analysis on how Learning Rate Warmup Accelerates Convergence
Abstract:
Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the beginning of the training process have not been fully understood. To resolve this gap between theory and practice, we first propose a novel family of generalized smoothness assumptions, and validate its applicability both theoretically and empirically. Under the novel smoothness assumption, we study the convergence properties of gradient descent (GD) in both deterministic and stochastic settings. It is shown that learning rate warmup consistently accelerates GD, and GD with warmup can converge at most $Θ(T)$ times faster than with a non-increasing learning rate schedule in some specific cases, providing insights into the benefits of this strategy from an optimization theory perspective.
中文: 学习率预热在广义光滑性假设下加速梯度下降收敛,在特定情况下可比非递增学习率快至\(Θ(T)\)倍。
English: Learning rate warmup accelerates gradient descent convergence under a novel generalized smoothness assumption, achieving up to \(Θ(T)\) times faster convergence than non-increasing schedules in certain cases.

Authors:Yunfei Zhong, Jun Yang, Yixing Fan, Lixin Su, Maarten de Rijke, Ruqing Zhang, Xueqi Cheng
Title: Reasoning-enhanced Query Understanding through Decomposition and Interpretation
Abstract:
Accurate inference of user intent is crucial for enhancing document retrieval in modern search engines. While large language models (LLMs) have made significant strides in this area, their effectiveness has predominantly been assessed with short, keyword-based queries. As AI-driven search evolves, long-form queries with intricate intents are becoming more prevalent, yet they remain underexplored in the context of LLM-based query understanding (QU). To bridge this gap, we introduce ReDI: a Reasoning-enhanced approach for query understanding through Decomposition and Interpretation. ReDI leverages the reasoning and comprehension capabilities of LLMs in a three-stage pipeline: (i) it breaks down complex queries into targeted sub-queries to accurately capture user intent; (ii) it enriches each sub-query with detailed semantic interpretations to improve the query-document matching; and (iii) it independently retrieves documents for each sub-query and employs a fusion strategy to aggregate the results for the final ranking. We compiled a large-scale dataset of real-world complex queries from a major search engine and distilled the query understanding capabilities of teacher models into smaller models for practical application. Experiments on BRIGHT and BEIR demonstrate that ReDI consistently surpasses strong baselines in both sparse and dense retrieval paradigms, affirming its effectiveness. We release our code at https://anonymous.4open.science/r/ReDI-6FC7/.
Chinese: ReDI提出了一种推理增强方法,利用大语言模型将复杂查询分解为子查询并丰富语义解释,通过融合检索结果在稀疏和稠密检索范式中均展现出优于现有方法的性能。
English: ReDI introduces a reasoning-enhanced approach using large language models to decompose complex queries into sub-queries, enrich them with interpretations, and fuse retrieval results, demonstrating superior performance over existing methods in both sparse and dense retrieval paradigms.

Authors:Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang
Title: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
Abstract:
Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.
中文: BranchGRPO通过将生成模型的展开过程重构为带共享前缀和剪枝的分支树结构,将训练效率提升高达55%,对齐分数提高16%,优于现有方法。
English: BranchGRPO enhances generative model alignment by restructuring rollouts into a branching tree with shared prefixes and pruning, improving efficiency by up to 55% and alignment scores by 16% over prior methods.

Authors:Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang
Title: BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
Abstract:
Recent progress in aligning image and video generative models with Group Relative Policy Optimization (GRPO) has improved human preference alignment, but existing variants remain inefficient due to sequential rollouts and large numbers of sampling steps, unreliable credit assignment: sparse terminal rewards are uniformly propagated across timesteps, failing to capture the varying criticality of decisions during denoising. In this paper, we present BranchGRPO, a method that restructures the rollout process into a branching tree, where shared prefixes amortize computation and pruning removes low-value paths and redundant depths. BranchGRPO introduces three contributions: (1) a branching scheme that amortizes rollout cost through shared prefixes while preserving exploration diversity; (2) a reward fusion and depth-wise advantage estimator that transforms sparse terminal rewards into dense step-level signals; and (3) pruning strategies that cut gradient computation but leave forward rollouts and exploration unaffected. On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO, while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment. On WanX video generation, it further achieves higher Video-Align scores with sharper and temporally consistent frames compared to DanceGRPO. Codes are available at \href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.
中文: BranchGRPO通过将生成模型的展开过程重构为带共享前缀和剪枝的分支树结构,将训练效率提升高达55%,对齐分数提高16%,优于现有方法。
English: BranchGRPO enhances generative model alignment by restructuring rollouts into a branching tree with shared prefixes and pruning, improving efficiency by up to 55% and alignment scores by 16% over prior methods.

Authors:Yi Yuan, Xubo Liu, Haohe Liu, Xiyuan Kang, Zhuo Chen, Yuxuan Wang, Mark D. Plumbley, Wenwu Wang
Title: DreamAudio: Customized Text-to-Audio Generation with Diffusion Models
Abstract:
With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.
中文:DreamAudio提出了一种定制化文本到音频生成框架,通过从用户提供的参考样本中学习听觉信息,能够精确控制细粒度声学特征,生成与个性化事件高度一致且与文本提示良好匹配的音频。
English: DreamAudio introduces a customized text-to-audio generation framework that enables precise control over fine-grained acoustic characteristics by learning from user-provided reference samples, producing audio highly consistent with personalized events while maintaining strong alignment with text prompts.

Authors:Ying Li, Xiaobao Wei, Xiaowei Chi, Yuming Li, Zhongyu Zhao, Hao Wang, Ningning Ma, Ming Lu, Shanghang Zhang
Title: ManipDreamer3D : Synthesizing Plausible Robotic Manipulation Video with Occupancy-aware 3D Trajectory
Abstract:
Data scarcity continues to be a major challenge in the field of robotic manipulation. Although diffusion models provide a promising solution for generating robotic manipulation videos, existing methods largely depend on 2D trajectories, which inherently face issues with 3D spatial ambiguity. In this work, we present a novel framework named ManipDreamer3D for generating plausible 3D-aware robotic manipulation videos from the input image and the text instruction. Our method combines 3D trajectory planning with a reconstructed 3D occupancy map created from a third-person perspective, along with a novel trajectory-to-video diffusion model. Specifically, ManipDreamer3D first reconstructs the 3D occupancy representation from the input image and then computes an optimized 3D end-effector trajectory, minimizing path length while avoiding collisions. Next, we employ a latent editing technique to create video sequences from the initial image latent and the optimized 3D trajectory. This process conditions our specially trained trajectory-to-video diffusion model to produce robotic pick-and-place videos. Our method generates robotic videos with autonomously planned plausible 3D trajectories, significantly reducing human intervention requirements. Experimental results demonstrate superior visual quality compared to existing methods.
中文摘要:ManipDreamer3D通过结合3D轨迹规划与新型扩散模型,生成具有自主规划3D轨迹的机器人操作视频,有效解决数据稀缺问题并显著减少人工干预需求。
English Summary: ManipDreamer3D addresses data scarcity in robotic manipulation by generating 3D-aware videos through 3D trajectory planning and a novel diffusion model, producing superior visual results with minimal human intervention.

Authors:Hui Wang, Cheng Liu, Junyang Chen, Haoze Liu, Yuhang Jia, Shiwan Zhao, Jiaming Zhou, Haoqin Sun, Hui Bu, Yong Qin
Title: TTA-Bench: A Comprehensive Benchmark for Evaluating Text-to-Audio Models
Abstract:
Text-to-Audio (TTA) generation has made rapid progress, but current evaluation methods remain narrow, focusing mainly on perceptual quality while overlooking robustness, generalization, and ethical concerns. We present TTA-Bench, a comprehensive benchmark for evaluating TTA models across functional performance, reliability, and social responsibility. It covers seven dimensions including accuracy, robustness, fairness, and toxicity, and includes 2,999 diverse prompts generated through automated and manual methods. We introduce a unified evaluation protocol that combines objective metrics with over 118,000 human annotations from both experts and general users. Ten state-of-the-art models are benchmarked under this framework, offering detailed insights into their strengths and limitations. TTA-Bench establishes a new standard for holistic and responsible evaluation of TTA systems. The dataset and evaluation tools are open-sourced at https://nku-hlt.github.io/tta-bench/.
中文: TTA-Bench提出了一套全面的文本到音频评估基准,通过涵盖准确性、鲁棒性、公平性等七个维度,结合2999个多样化提示和超11.8万条人工标注,对十种先进模型进行功能性能、可靠性及社会责任的多维度测评,建立了整体性评估新标准。
English: TTA-Bench is introduced as a comprehensive benchmark addressing the narrow scope of current Text-to-Audio evaluation by assessing seven dimensions across functional performance, reliability, and social responsibility, using 2,999 prompts and over 118,000 human annotations to evaluate ten state-of-the-art models holistically.

Authors:Yiyang Wang, Xi Chen, Xiaogang Xu, Yu Liu, Hengshuang Zhao
Title: DiffCamera: Arbitrary Refocusing on Images
Abstract:
The depth-of-field (DoF) effect, which introduces aesthetically pleasing blur, enhances photographic quality but is fixed and difficult to modify once the image has been created. This becomes problematic when the applied blur is undesirable~(e.g., the subject is out of focus). To address this, we propose DiffCamera, a model that enables flexible refocusing of a created image conditioned on an arbitrary new focus point and a blur level. Specifically, we design a diffusion transformer framework for refocusing learning. However, the training requires pairs of data with different focus planes and bokeh levels in the same scene, which are hard to acquire. To overcome this limitation, we develop a simulation-based pipeline to generate large-scale image pairs with varying focus planes and bokeh levels. With the simulated data, we find that training with only a vanilla diffusion objective often leads to incorrect DoF behaviors due to the complexity of the task. This requires a stronger constraint during training. Inspired by the photographic principle that photos of different focus planes can be linearly blended into a multi-focus image, we propose a stacking constraint during training to enforce precise DoF manipulation. This constraint enhances model training by imposing physically grounded refocusing behavior that the focusing results should be faithfully aligned with the scene structure and the camera conditions so that they can be combined into the correct multi-focus image. We also construct a benchmark to evaluate the effectiveness of our refocusing model. Extensive experiments demonstrate that DiffCamera supports stable refocusing across a wide range of scenes, providing unprecedented control over DoF adjustments for photography and generative AI applications.
中文: DiffCamera提出了一种扩散变换器模型,通过模拟不同焦点和虚化水平实现灵活的图像重对焦,并引入堆叠约束以精确控制景深,经广泛实验验证其有效性。
English: DiffCamera introduces a diffusion transformer model that enables flexible image refocusing by simulating varied focus and blur levels, enhanced with a stacking constraint for precise depth-of-field manipulation, validated through extensive experiments.

Authors:Harshith Gowrachari, Mattia Giuseppe Barra, Giovanni Stabile, Gianluca Bazzaro, Gianluigi Rozza
Title: Reservoir computing based predictive reduced order model for steel grade intermixing in an industrial continuous casting tundish
Abstract:
Continuous casting is a widely adopted process in the steel industry, where maintaining high steel quality is paramount. Efficient prediction of grade intermixing during ladle changeover operations is critical for maintaining steel quality and minimizing material losses in the continuous casting process. Among various factors influencing grade intermixing, operating parameters play a significant role, in addition to tundish geometry and flow control devices. In this study, three-dimensional, transient, two-phase turbulent flow simulations are conducted to investigate the ladle changeover operation. During this process, the molten steel level in the tundish typically varies over time, significantly affecting the grade intermixing phenomena. The influence of ladle change time on intermixing time has been presented. However, high-fidelity full-order simulations of such complex transient phenomena are computationally expensive and are impractical for real-time monitoring or design-space exploration in industrial-scale applications. To address this issue, a reduced order modelling approach based on proper orthogonal decomposition (POD) and reservoir computing (RC) is employed to efficiently predict intermixing time. The proposed reduced order model (ROM) demonstrates excellent predictive accuracy using limited training data while requiring significantly less computational resources and training time. The results demonstrate the potential of the proposed methodology as a fast, reliable tool for real-time process monitoring and optimization in industrial continuous casting operations.
中文摘要:本研究采用基于本征正交分解和储层计算的降阶模型,有效预测钢包更换过程中的钢种混浇现象,为连铸工业实时监控提供了快速可靠的解决方案。
English Summary: This study develops a reduced order model combining proper orthogonal decomposition and reservoir computing to efficiently predict steel grade intermixing during ladle changeovers, offering a fast and accurate solution for real-time monitoring in continuous casting operations.

Authors:Xiaoyan Zhao, Ming Yan, Yang Zhang, Yang Deng, Jian Wang, Fengbin Zhu, Yilun Qiu, Hong Cheng, Tat-Seng Chua
Title: Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts
Abstract:
Conversational Recommender Systems (CRSs) aim to provide personalized recommendations through multi-turn natural language interactions with users. Given the strong interaction and reasoning skills of Large Language Models (LLMs), leveraging LLMs for CRSs has recently emerged as a promising direction. However, existing LLM-based methods often lack explicit optimization of interaction strategies, instead relying on unified prompts and the LLM's internal knowledge to decide how to interact, which can lead to suboptimal outcomes. In this paper, we propose a novel Reinforced Strategy Optimization (RSO) method for CRS, which decomposes the process of generating strategy-driven response decisions into the macro-level strategy planning and micro-level strategy adaptation through a network-of-experts architecture. At the macro level, a Planner expert selects macro-level interaction strategies (e.g., recommend, explain, encourage). At the micro level, an Actor expert generates detailed responses conditioned on the selected macro-level strategy, guided by auxiliary experts that provide complementary information such as user preferences and factual grounding. This hierarchical decomposition disentangles the optimization of different sub-tasks involved in CRS response generation, enabling more tractable learning at each level. To address the scarcity of high-quality multi-turn training data, we formulate strategy learning as a reinforcement learning problem, guided by an LLM-based reward model to achieve automatic strategy exploration. Extensive experiments show that RSO significantly improves interaction performance compared to state-of-the-art baselines, demonstrating the effectiveness of explicit hierarchical strategy optimization for CRS.
中文: 本文提出了一种用于对话推荐系统的强化策略优化方法,通过专家网络架构将响应生成分解为宏观策略规划和微观策略适应,并利用基于大语言模型的奖励模型进行强化学习,显著提升了交互性能。
English: This paper introduces a Reinforced Strategy Optimization (RSO) method for conversational recommender systems, which hierarchically decomposes response generation into macro-level strategy planning and micro-level adaptation using a network-of-experts architecture, significantly improving interaction performance through reinforcement learning guided by an LLM-based reward model.

Authors:Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, Jing Xiong, Kashif Rasul, Mac Schwager, Anderson Schneider, Zhangyang Wang, Yuriy Nevmyvaka
Title: Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel
Abstract:
Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.
中文摘要:本研究通过提出KERN这一新型前馈神经网络风格的路由器,挑战了混合专家模型中传统使用Softmax的做法,该路由器能泛化现有方法,并通过实验验证了其有效性。
English Summary: This study challenges the conventional use of Softmax in Mixture-of-Experts models by proposing KERN, a novel FFN-style router that generalizes existing approaches and demonstrates improved performance through empirical validation.

Authors:Xintong Li, Chuhan Wang, Junda Wu, Rohan Surana, Tong Yu, Julian McAuley, Jingbo Shang
Title: Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization
Abstract:
Direct Preference Optimization (DPO) has recently been extended from text-only models to vision-language models. However, existing methods rely on oversimplified pairwise comparisons, generating a single negative image via basic perturbations or similarity-based retrieval, which fail to capture the complex nature of multimodal preferences, inducing optimization bias and hallucinations. To address this issue, we propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO via the Plackett-Luce model. Our method embeds prompts and candidate images in CLIP (Contrastive Language-Image Pretraining) space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors. Negative samples are selected based on reconstruction difficulty, semantic deviation from the positive, and mutual diversity, yielding broader and more informative supervision. To handle multi-negative comparisons, we adopt a Plackett-Luce objective and introduce an importance sampling strategy that improves training efficiency. Experiments across five diverse benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods, validating the effectiveness of semantic-aware, multi-negative sampling in preference-based learning.
中文: MISP-DPO提出了一种创新框架,通过普拉凯特-卢斯模型引入多个语义多样的负样本图像来增强多模态直接偏好优化,在多个基准测试中显著提升了多模态对齐效果。
English: MISP-DPO introduces a novel framework that enhances multimodal Direct Preference Optimization by employing multiple semantically diverse negative images through the Plackett-Luce model, significantly improving multimodal alignment across various benchmarks.

Authors:Huihao Jing, Wenbin Hu, Hongyu Luo, Jianhui Yang, Wei Fan, Haoran Li, Yangqiu Song
Title: MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning
Abstract:
Multi-agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real-world legal situations. Furthermore, we manually design various role-based MAS and conduct extensive experiments using different state-of-the-art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.
中文: 本文提出MASLegalBench这一专为多智能体系统设计的法律基准,通过GDPR场景评估其在任务分解与推理中的能力,揭示了现有模型的优势与改进空间。
English: This paper introduces MASLegalBench, a legal benchmark designed for multi-agent systems (MAS) using GDPR scenarios to evaluate their capabilities in task decomposition and reasoning, revealing strengths and areas for improvement in current models.

Authors:Yiting Dong, Jianhao Ding, Zijie Xu, Tong Bu, Zhaofei Yu, Tiejun Huang
Title: PredNext: Explicit Cross-View Temporal Prediction for Unsupervised Learning in Spiking Neural Networks
Abstract:
Spiking Neural Networks (SNNs), with their temporal processing capabilities and biologically plausible dynamics, offer a natural platform for unsupervised representation learning. However, current unsupervised SNNs predominantly employ shallow architectures or localized plasticity rules, limiting their ability to model long-range temporal dependencies and maintain temporal feature consistency. This results in semantically unstable representations, thereby impeding the development of deep unsupervised SNNs for large-scale temporal video data. We propose PredNext, which explicitly models temporal relationships through cross-view future Step Prediction and Clip Prediction. This plug-and-play module seamlessly integrates with diverse self-supervised objectives. We firstly establish standard benchmarks for SNN self-supervised learning on UCF101, HMDB51, and MiniKinetics, which are substantially larger than conventional DVS datasets. PredNext delivers significant performance improvements across different tasks and self-supervised methods. PredNext achieves performance comparable to ImageNet-pretrained supervised weights through unsupervised training solely on UCF101. Additional experiments demonstrate that PredNext, distinct from forced consistency constraints, substantially improves temporal feature consistency while enhancing network generalization capabilities. This work provides a effective foundation for unsupervised deep SNNs on large-scale temporal video data.
Chinese: PredNext为脉冲神经网络提出了一种即插即用模块,通过跨视角未来预测任务建模时序关系,在大型视频数据上显著提升了无监督学习性能,在达到与监督方法相当效果的同时,有效增强了时序特征一致性和网络泛化能力。
English: PredNext introduces a plug-and-play module for Spiking Neural Networks that enhances unsupervised learning on large-scale video data by modeling temporal relationships through future prediction tasks, achieving performance comparable to supervised methods while improving temporal consistency and generalization.

Authors:Guibin Zhang, Fanci Meng, Guancheng Wan, Zherui Li, Kun Wang, Zhenfei Yin, Lei Bai, Shuicheng Yan
Title: LatentEvolve: Self-Evolving Test-Time Scaling in Latent Space
Abstract:
Test-time Scaling (TTS) has been demonstrated to significantly enhance the reasoning capabilities of Large Language Models (LLMs) during the inference phase without altering model parameters. However, existing TTS methods are largely independent, implying that LLMs have not yet evolved to progressively learn how to scale more effectively. With the objective of evolving LLMs to learn ``how to scale test-time computation,'' we propose LatentEvolve, a self-evolving latent TTS framework inspired by the complementary learning system (CLS) theory. Analogous to the human brain's dual system of a fast-recall hippocampus and a slow-consolidating neocortex, LatentEvolve comprises two evolutionary components: \textit{daytime scaling}, which rapidly retrieves historical latent representations to better guide current LLM reasoning; and \textit{nighttime scaling}, which integrates past latent optimizations in a manner akin to the human brain's consolidation of experiences during sleep. The alternation of daytime and nighttime processes facilitates a fast and slow evolution of LLM TTS, mirroring human cognitive dynamics in a fully unsupervised manner. Extensive experiments across eight benchmarks and five model backbones demonstrate that our LatentEvolve surpasses state-of-the-art TTS methods such as LatentSeek and TTRL by up to $13.33\%$ and exhibits exceptional cross-domain and cross-backbone generalization.
中文摘要:LatentEvolve通过模拟人类记忆系统的昼夜交替机制,实现了无需参数更新的测试时计算自适应进化,在多个基准测试中显著超越了现有最优方法并展现出卓越的泛化能力。
English Summary: LatentEvolve is a self-evolving framework that enhances LLM reasoning through alternating daytime and nighttime scaling processes, demonstrating superior performance and generalization over existing methods without parameter updates.

Authors:Rahul Halder, Giovanni Stabile, Gianluigi Rozza
Title: Coupling Physics Informed Neural Networks with External Solvers
Abstract:
The current work aims to incorporate physics-based loss in Physics Informed Neural Network (PINN) directly using the numerical residual obtained from the governing equation in any dicretized forward solver. PINN's major difficulties in coupling with external forward solvers arise from the inability to access the discretized form (Finite difference, finite volume, finite element, etc.) of the governing equation directly through the network and to include them in its computational graph. This poses a significant challenge to conventional automatic-differentiation-based derivative computation of physics-based loss terms concerning the neural network hyperparameters if gradient-based optimization techniques are adopted. Therefore, we propose modifying the physics-based loss term to account for the residual arising from the external solver and to compute the derivative required for the optimization machinery. The proposed methodologies are demonstrated on benchmark full-order and reduced-order systems.
中文: 本研究改进物理信息神经网络中的物理损失项,通过引入外部求解器的数值残差,克服了传统方法难以直接利用离散化控制方程进行梯度优化的关键难题。
English: This study modifies the physics-based loss term in Physics Informed Neural Networks (PINN) to incorporate numerical residuals from external solvers, enabling effective gradient-based optimization despite the challenges of accessing discretized governing equations.

Authors:Liang Lin, Zhihao Xu, Junhao Dong, Jian Zhao, Yuchen Yuan, Guibin Zhang, Miao Yu, Yiming Zhang, Zhengtao Yao, Huahui Yi, Dongrui Liu, Xinfeng Li, Kun Wang
Title: OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment
Abstract:
Large language model (LLM) alignment faces a critical dilemma when addressing multiple human preferences: improvements in one dimension frequently come at the expense of others, creating unavoidable trade-offs between competing objectives like helpfulness and harmlessness. While prior work mainly focuses on constraint-based optimization algorithms and data selection strategies to mitigate conflicts, these approaches overlook the fundamental issue of resolving conflicts directly at the parameter level. In this paper, we present OrthAlign, an innovative approach that pioneers a new paradigm by leveraging orthogonal subspace decomposition to fundamentally resolve gradient-level conflicts in multi-objective preference alignment. OrthAlign strategically decomposes parameter update spaces into orthogonal subspaces, ensuring that optimization toward different preferences occurs in mathematically non-interfering directions. Building upon this, we provide theoretical guarantees demonstrating that when parameter increments satisfy both orthogonal subspace constraints and spectral norm bounds, the resulting updates exhibit linear Lipschitz growth rather than exponential instability, ensuring stable convergence across all preference dimensions. Extensive experiments show that: I. OrthAlign achieves maximum single-preference improvements ranging from 34.61% to 50.89% after multiple-objective alignment across helpful, harmless, and truthful dimensions. II. With an average overall reward improvement of 13.96%.
中文摘要:OrthAlign通过正交子空间分解创新性地解决了多目标大语言模型对齐中的梯度冲突问题,在保持稳定收敛的同时实现了多个偏好维度的显著性能提升。
English Summary: OrthAlign introduces a novel method using orthogonal subspace decomposition to resolve gradient conflicts in multi-objective LLM alignment, achieving significant improvements across multiple preference dimensions while ensuring stable convergence.

Authors:Haoran Li, Yulin Chen, Jingru Zeng, Hao Peng, Huihao Jing, Wenbin Hu, Xi Yang, Ziqian Zeng, Sirui Han, Yangqiu Song
Title: GSPR: Aligning LLM Safeguards as Generalizable Safety Policy Reasoners
Abstract:
As large language models (LLMs) are increasingly integrated into numerous applications across various domains, LLMs' safety becomes a critical concern for both application developers and intended users. Currently, great efforts have been made to develop safety benchmarks with fine-grained taxonomies. However, these benchmarks' taxonomies are disparate with different safety policies. Thus, existing safeguards trained on these benchmarks are either coarse-grained to only distinguish between safe and unsafe, or constrained by the narrow risk taxonomies of a single benchmark. To leverage these fine-grained safety taxonomies across multiple safety benchmarks, in this paper, we propose GSPR, a Generalizable Safety Policy Reasoner to identify unsafe input prompts and LLMs' outputs with violated safety taxonomies through Group Relative Policy Optimization (GRPO). Unlike prior safeguards which only cover a fixed set of risk factors, our GSPR incentivizes its reasoning capability with varied safety taxonomies through our careful cold-start strategy and reward design. Consequently, our GSPR can be trained across multiple safety benchmarks with distinct taxonomies and naturally exhibits powerful generalization ability. We conduct extensive experiments to show that our GSPR significantly improves existing safety guardrails' reasoning capabilities for both safety and category prediction tasks. Moreover, our GSPR not only demonstrates powerful safety generalization abilities but also achieves the least inference token costs with explanations.
中文: 随着大语言模型(LLMs)在各领域广泛应用,其安全性日益重要,为此我们提出GSPR,一种通用安全策略推理器,通过群体相对策略优化和精心设计的奖励机制,提升跨多个安全基准的推理能力和泛化性能。
English: With the growing use of large language models (LLMs) across applications, ensuring their safety is crucial, leading to the development of GSPR, a Generalizable Safety Policy Reasoner that enhances safety reasoning and generalization across diverse benchmarks through innovative optimization and reward design.

Authors:Guoliang Zhao, Yuhan Fu, Shuaipeng Li, Xingwu Sun, Ruobing Xie, An Wang, Weidong Han, Zhen Yang, Weixuan Sun, Yudong Zhang, Cheng-zhong Xu, Di Wang, Jie Jiang
Title: Towards a Comprehensive Scaling Law of Mixture-of-Experts
Abstract:
Mixture-of-Experts (MoE) models have become the consensus approach for enabling parameter-efficient scaling and cost-effective deployment in large language models. However, existing scaling laws for dense models are inapplicable to MoE models, which stems from three critical challenges: the multiplicity of influencing factors, their intricate coupling relationships and the non-monotonic nature of their performance impacts. They collectively necessitate a fine-grained investigation into MoE-specific scaling laws. In this work, we perform a systematic decomposition of MoE settings, identifying five key factors that influence model performance from both size and structural perspectives (data size ($D$), total model size ($N$), activated model size ($N_a$), number of active experts ($G$) and the ratio of shared experts ($S$)). Specifically, we design $446$ controlled experiments to characterize their marginal effects, ultimately constructing a comprehensive and precise joint MoE scaling law that considers all essential factors. Furthermore, we derive the theoretically optimal and practically efficiency-aware optimal configurations for $G$, $S$ and $N_a/N$ with detailed analyses. Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size. With the scaling of $N$, the optimal activation parameter ratio of $N_a/N$ becomes sparser. Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.
中文: 混合专家模型因影响因素复杂需专门扩展法则,本研究建立的联合扩展定律揭示了最优配置与模型架构及数据规模无关,并指导参数高效激活。
English: Mixture-of-Experts models require specialized scaling laws due to complex factor interactions, leading to a comprehensive joint scaling law that reveals optimal configurations independent of model architecture and data size.

Authors:Maojiang Su, Mingcheng Lu, Jerry Yao-Chieh Hu, Shang Wu, Zhao Song, Alex Reneau, Han Liu
Title: A Theoretical Analysis of Discrete Flow Matching Generative Models
Abstract:
We provide a theoretical analysis for end-to-end training Discrete Flow Matching (DFM) generative models. DFM is a promising discrete generative modeling framework that learns the underlying generative dynamics by training a neural network to approximate the transformative velocity field. Our analysis establishes a clear chain of guarantees by decomposing the final distribution estimation error. We first prove that the total variation distance between the generated and target distributions is controlled by the risk of the learned velocity field. We then bound this risk by analyzing its two primary sources: (i) Approximation Error, where we quantify the capacity of the Transformer architecture to represent the true velocity, and (ii) Estimation Error, where we derive statistical convergence rates that bound the error from training on a finite dataset. By composing these results, we provide the first formal proof that the distribution generated by a trained DFM model provably converges to the true data distribution as the training set size increases.
中文: 该理论分析通过将估计误差分解为近似误差和统计误差,证明了训练后的离散流匹配模型生成的分布会随着训练集规模的增大而收敛到真实数据分布。
English: This theoretical analysis demonstrates that the distribution generated by a trained Discrete Flow Matching model converges to the true data distribution as the training set size increases, by decomposing the estimation error into approximation and statistical components.

Authors:Wenbin Hu, Huihao Jing, Haochen Shi, Haoran Li, Yangqiu Song
Title: Safety Compliance: Rethinking LLM Safety Reasoning through the Lens of Compliance
Abstract:
The proliferation of Large Language Models (LLMs) has demonstrated remarkable capabilities, elevating the critical importance of LLM safety. However, existing safety methods rely on ad-hoc taxonomy and lack a rigorous, systematic protection, failing to ensure safety for the nuanced and complex behaviors of modern LLM systems. To address this problem, we solve LLM safety from legal compliance perspectives, named safety compliance. In this work, we posit relevant established legal frameworks as safety standards for defining and measuring safety compliance, including the EU AI Act and GDPR, which serve as core legal frameworks for AI safety and data security in Europe. To bridge the gap between LLM safety and legal compliance, we first develop a new benchmark for safety compliance by generating realistic LLM safety scenarios seeded with legal statutes. Subsequently, we align Qwen3-8B using Group Policy Optimization (GRPO) to construct a safety reasoner, Compliance Reasoner, which effectively aligns LLMs with legal standards to mitigate safety risks. Our comprehensive experiments demonstrate that the Compliance Reasoner achieves superior performance on the new benchmark, with average improvements of +10.45% for the EU AI Act and +11.85% for GDPR.
中文: 本研究通过引入法律合规框架解决现有大语言模型安全方法的不足,基于欧盟AI法案和GDPR创建新基准,并开发出合规推理器,使模型与法律标准对齐后安全性能提升超过10%。
English: This study addresses the limitations of current Large Language Model safety methods by introducing a legal compliance framework, developing a benchmark based on EU AI Act and GDPR standards, and creating a Compliance Reasoner that improves safety performance by over 10% through alignment with legal requirements.

Authors:Ziqing Wang, Yibo Wen, William Pattie, Xiao Luo, Weimin Wu, Jerry Yao-Chieh Hu, Abhishek Pandey, Han Liu, Kaize Ding
Title: POLO: Preference-Guided Multi-Turn Reinforcement Learning for Lead Optimization
Abstract:
Lead optimization in drug discovery requires efficiently navigating vast chemical space through iterative cycles to enhance molecular properties while preserving structural similarity to the original lead compound. Despite recent advances, traditional optimization methods struggle with sample efficiency-achieving good optimization performance with limited oracle evaluations. Large Language Models (LLMs) provide a promising approach through their in-context learning and instruction following capabilities, which align naturally with these iterative processes. However, existing LLM-based methods fail to leverage this strength, treating each optimization step independently. To address this, we present POLO (Preference-guided multi-turn Optimization for Lead Optimization), which enables LLMs to learn from complete optimization trajectories rather than isolated steps. At its core, POLO introduces Preference-Guided Policy Optimization (PGPO), a novel reinforcement learning algorithm that extracts learning signals at two complementary levels: trajectory-level optimization reinforces successful strategies, while turn-level preference learning provides dense comparative feedback by ranking intermediate molecules within each trajectory. Through this dual-level learning from intermediate evaluation, POLO achieves superior sample efficiency by fully exploiting each costly oracle call. Extensive experiments demonstrate that POLO achieves 84% average success rate on single-property tasks (2.3x better than baselines) and 50% on multi-property tasks using only 500 oracle evaluations, significantly advancing the state-of-the-art in sample-efficient molecular optimization.
中文摘要:POLO提出了一种新颖的强化学习方法,通过从完整优化轨迹和比较反馈中学习,在有限评估次数下实现了分子优化的显著样本效率提升。
English Summary: POLO introduces a novel reinforcement learning approach that learns from complete optimization trajectories and comparative feedback, achieving significantly higher sample efficiency in molecular optimization with limited oracle evaluations.

Authors:Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, Robert Sim
Title: Learning GUI Grounding with Spatial Reasoning from Visual Feedback
Abstract:
Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing high-resolution GUI images with complex layouts. To address this issue, we reframe GUI grounding as an \emph{interactive search task}, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results on ScreenSpot-v2 ($88.8\% \rightarrow 93.9\%$) and ScreenSpot-Pro ($26.8\% \rightarrow 56.5\%$). Moreover, we observe that GUI-Cursor learns to solve the problem within two steps for 95\% of instances and can adaptively conduct more steps on more difficult examples.
中文: 本研究将图形用户界面定位重构为交互式搜索任务,通过视觉语言模型逐步移动光标定位界面元素,利用多步强化学习实现了最先进的定位精度。
English: This research reframes GUI grounding as an interactive search task where a Vision Language Model moves a cursor step-by-step to locate UI elements, achieving state-of-the-art accuracy through multi-step reinforcement learning.

Authors:Chengkai Huang, Junda Wu, Zhouhang Xie, Yu Xia, Rui Wang, Tong Yu, Subrata Mitra, Julian McAuley, Lina Yao
Title: Pluralistic Off-policy Evaluation and Alignment
Abstract:
Personalized preference alignment for LLMs with diverse human preferences requires evaluation and alignment methods that capture pluralism. Most existing preference alignment datasets are logged under policies that differ substantially from the evaluated LLMs, and existing off-policy estimators focus solely on overall utility while ignoring preference pluralism. Extending Off-Policy Evaluation (OPE) to pluralistic preference alignment, therefore, remains an open question. Thus, we propose the Pluralistic Off-Policy Evaluation (POPE), the first framework for offline pluralistic preference evaluation and alignment in LLMs. POPE includes a unified reward function that combines (1) a collaborative utility component derived from human preference signals (e.g., upvotes or relevance scores) and (2) a diversity component inspired by entropy-based coverage measures, together reflecting pluralistic alignment. Furthermore, to estimate this reward from logged interactions, we derive decomposable inverse propensity scoring (IPS) estimators that separately evaluate relevance and diversity. Theoretically, we prove that our decomposed IPS estimators establish a lower bound on their variance. With the off-policy evaluated value function, we can directly enable off-policy optimization to further enhance pluralistic alignment. Empirical results demonstrate that POPE efficiently enhances pluralistic response generation and maintains the models' general capabilities on downstream tasks
中文摘要:提出的多元离策略评估(POPE)框架通过统一奖励函数和分解评估器,能够分别评估相关性与多样性,从而实现对具有多元人类偏好的大语言模型进行离线评估和对齐。
English Summary: The proposed Pluralistic Off-Policy Evaluation (POPE) framework enables offline evaluation and alignment of LLMs with diverse human preferences through a unified reward function and decomposed estimators that separately assess relevance and diversity.

Authors:Tom Pawelek, Raj Patel, Charlotte Crowell, Noorbakhsh Amiri, Sudip Mittal, Shahram Rahimi, Andy Perkins
Title: LLMZ+: Contextual Prompt Whitelist Principles for Agentic LLMs
Abstract:
Compared to traditional models, agentic AI represents a highly valuable target for potential attackers as they possess privileged access to data sources and API tools, which are traditionally not incorporated into classical agents. Unlike a typical software application residing in a Demilitarized Zone (DMZ), agentic LLMs consciously rely on nondeterministic behavior of the AI (only defining a final goal, leaving the path selection to LLM). This characteristic introduces substantial security risk to both operational security and information security. Most common existing defense mechanism rely on detection of malicious intent and preventing it from reaching the LLM agent, thus protecting against jailbreak attacks such as prompt injection. In this paper, we present an alternative approach, LLMZ+, which moves beyond traditional detection-based approaches by implementing prompt whitelisting. Through this method, only contextually appropriate and safe messages are permitted to interact with the agentic LLM. By leveraging the specificity of context, LLMZ+ guarantees that all exchanges between external users and the LLM conform to predefined use cases and operational boundaries. Our approach streamlines the security framework, enhances its long-term resilience, and reduces the resources required for sustaining LLM information security. Our empirical evaluation demonstrates that LLMZ+ provides strong resilience against the most common jailbreak prompts. At the same time, legitimate business communications are not disrupted, and authorized traffic flows seamlessly between users and the agentic LLM. We measure the effectiveness of approach using false positive and false negative rates, both of which can be reduced to 0 in our experimental setting.
中文摘要:智能体AI因其数据特权和非确定性行为带来重大安全风险,而LLMZ+系统通过情境感知的白名单机制有效防御越狱攻击,在保障合法通信的同时将误报和漏报率降至零。
English Summary: Agentic AI poses significant security risks due to its privileged data access and nondeterministic behavior, but the proposed LLMZ+ system enhances protection through context-aware prompt whitelisting that prevents jailbreak attacks while maintaining legitimate operations.

Authors:Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Cheng Hao, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Bian Jiang, Javier Alvarez-Valle, Mu Wei, Jianfeng Gao, Eric Horvitz, Matt Lungren, Hoifung Poon, Paul Vozila
Title: The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks
Abstract:
Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren't glitches; they expose how today's benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.
中文总结:尽管大型AI模型在医学基准测试中得分领先,但压力测试显示其存在脆弱性、走捷径学习及不可靠推理的问题,无法真实反映临床实践中的可靠性。
English Summary: Despite achieving top scores on medical benchmarks, large AI models exhibit brittleness, shortcut learning, and unreliable reasoning under stress tests, failing to reflect real-world clinical readiness.

Authors:Yu Gu, Jingjing Fu, Xiaodong Liu, Jeya Maria Jose Valanarasu, Noel CF Codella, Reuben Tan, Qianchu Liu, Ying Jin, Sheng Zhang, Jinyu Wang, Rui Wang, Lei Song, Guanghui Qin, Naoto Usuyama, Cliff Wong, Hao Cheng, Hohin Lee, Praneeth Sanapathi, Sarah Hilado, Jiang Bian, Javier Alvarez-Valle, Mu Wei, Khalil Malik, Jianfeng Gao, Eric Horvitz, Matthew P Lungren, Hoifung Poon, Paul Vozila
Title: The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks
Abstract:
Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren't glitches; they expose how today's benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.
中文总结:尽管大型AI模型在医学基准测试中得分领先,但压力测试显示其存在脆弱性、走捷径学习及不可靠推理的问题,无法真实反映临床实践中的可靠性。
English Summary: Despite achieving top scores on medical benchmarks, large AI models exhibit brittleness, shortcut learning, and unreliable reasoning under stress tests, failing to reflect real-world clinical readiness.

Authors:Yichen Wang, Hangtao Zhang, Hewen Pan, Ziqi Zhou, Xianlong Wang, Peijin Guo, Lulu Xue, Shengshan Hu, Minghui Li, Leo Yu Zhang
Title: ADVEDM:Fine-grained Adversarial Attack against VLM-based Embodied Agents
Abstract:
Vision-Language Models (VLMs), with their strong reasoning and planning capabilities, are widely used in embodied decision-making (EDM) tasks in embodied agents, such as autonomous driving and robotic manipulation. Recent research has increasingly explored adversarial attacks on VLMs to reveal their vulnerabilities. However, these attacks either rely on overly strong assumptions, requiring full knowledge of the victim VLM, which is impractical for attacking VLM-based agents, or exhibit limited effectiveness. The latter stems from disrupting most semantic information in the image, which leads to a misalignment between the perception and the task context defined by system prompts. This inconsistency interrupts the VLM's reasoning process, resulting in invalid outputs that fail to affect interactions in the physical world. To this end, we propose a fine-grained adversarial attack framework, ADVEDM, which modifies the VLM's perception of only a few key objects while preserving the semantics of the remaining regions. This attack effectively reduces conflicts with the task context, making VLMs output valid but incorrect decisions and affecting the actions of agents, thus posing a more substantial safety threat in the physical world. We design two variants of based on this framework, ADVEDM-R and ADVEDM-A, which respectively remove the semantics of a specific object from the image and add the semantics of a new object into the image. The experimental results in both general scenarios and EDM tasks demonstrate fine-grained control and excellent attack performance.
Chinese: ADVEDM框架提出了一种细粒度对抗攻击方法,通过微妙修改图像中关键对象来操控视觉语言模型的输出,使其产生有效但错误的决策,从而对具身智能体在物理环境中的安全性构成实质性威胁。
English: The ADVEDM framework introduces a fine-grained adversarial attack that subtly alters key objects in images to manipulate Vision-Language Models' outputs, producing valid but incorrect decisions that threaten embodied agents' safety in physical environments.

Authors:Cheng Jiayang, Qianqian Zhuang, Haoran Li, Chunkit Chan, Xin Liu, Lin Qiu, Yangqiu Song
Title: InteGround: On the Evaluation of Verification and Retrieval Planning in Integrative Grounding
Abstract:
Grounding large language models (LLMs) in external knowledge sources is a promising method for faithful prediction. While existing grounding approaches work well for simple queries, many real-world information needs require synthesizing multiple pieces of evidence. We introduce "integrative grounding" -- the challenge of retrieving and verifying multiple inter-dependent pieces of evidence to support a hypothesis query. To systematically study this problem, we repurpose data from four domains for evaluating integrative grounding capabilities. Our investigation reveals two critical findings: First, in groundedness verification, while LLMs are robust to redundant evidence, they tend to rationalize using internal knowledge when information is incomplete. Second, in examining retrieval planning strategies, we find that undirected planning can degrade performance through noise introduction, while premise abduction emerges as a promising approach due to its logical constraints. Additionally, LLMs' zero-shot self-reflection capabilities consistently improve grounding quality. These insights provide valuable direction for developing more effective integrative grounding systems.
中文摘要:综合式基础研究旨在解决复杂查询中多证据综合的难题,研究发现大语言模型在冗余信息下保持稳健,但面对信息缺失时会依赖内部知识进行合理化,而前提溯因与自我反思能力能有效提升基础质量。
English Summary: Integrative grounding addresses the challenge of synthesizing multiple interdependent evidence pieces for complex queries, revealing that LLMs maintain robustness with redundant information but tend to rationalize gaps with internal knowledge, while premise abduction and self-reflection significantly enhance grounding effectiveness.

Authors:Tan-Hiep To, Duy-Khang Nguyen, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Title: GenKOL: Modular Generative AI Framework For Scalable Virtual KOL Generation
Abstract:
Key Opinion Leader (KOL) play a crucial role in modern marketing by shaping consumer perceptions and enhancing brand credibility. However, collaborating with human KOLs often involves high costs and logistical challenges. To address this, we present GenKOL, an interactive system that empowers marketing professionals to efficiently generate high-quality virtual KOL images using generative AI. GenKOL enables users to dynamically compose promotional visuals through an intuitive interface that integrates multiple AI capabilities, including garment generation, makeup transfer, background synthesis, and hair editing. These capabilities are implemented as modular, interchangeable services that can be deployed flexibly on local machines or in the cloud. This modular architecture ensures adaptability across diverse use cases and computational environments. Our system can significantly streamline the production of branded content, lowering costs and accelerating marketing workflows through scalable virtual KOL creation.
中文: GenKOL是一个利用生成式AI的交互系统,帮助营销专业人士通过服装生成、妆容迁移等模块化服务高效创建高质量虚拟关键意见领袖图像,从而降低合作成本并优化营销内容生产流程。
English: GenKOL is an interactive system that uses generative AI to help marketing professionals efficiently create high-quality virtual Key Opinion Leader images, reducing costs and streamlining content production through modular services like garment generation and makeup transfer.

Authors:Weimin Wu, Xuefeng Song, Yibo Wen, Qinjie Lin, Zhihan Zhou, Jerry Yao-Chieh Hu, Zhong Wang, Han Liu
Title: Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models
Abstract:
We introduce Genome-Factory, an integrated Python library for tuning, deploying, and interpreting genomic models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control, such as GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to interpretable genomic features by regressing on external readouts. To improve accessibility, Genome-Factory features both a zero-code command-line interface and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its end-to-end usability and practical value for real-world genomic analysis.
中文: Genome-Factory是一个集成式Python库,通过自动化数据处理、多样化调优方法、高效推理、灵活基准测试及首创的可解释性模块,简化了基因组模型的开发流程,其端到端实用性已在真实基因组分析中得到验证。
English: Genome-Factory is a comprehensive Python library that streamlines the development of genomic models through automated data handling, flexible tuning methods, efficient inference, robust benchmarking, and innovative interpretability tools, validated for end-to-end usability in genomic analysis.

Authors:Xuewei Feng, Zhaoxi Li, Qi Li, Ziqiang Wang, Kun Sun, Ke Xu
Title: Off-Path TCP Exploits: PMTUD Breaks TCP Connection Isolation in IP Address Sharing Scenarios
Abstract:
Path MTU Discovery (PMTUD) and IP address sharing are integral aspects of modern Internet infrastructure. In this paper, we investigate the security vulnerabilities associated with PMTUD within the context of prevalent IP address sharing practices. We reveal that PMTUD is inadequately designed to handle IP address sharing, creating vulnerabilities that attackers can exploit to perform off-path TCP hijacking attacks. We demonstrate that by observing the path MTU value determined by a server for a public IP address (shared among multiple devices), an off-path attacker on the Internet, in collaboration with a malicious device, can infer the sequence numbers of TCP connections established by other legitimate devices sharing the same IP address. This vulnerability enables the attacker to perform off-path TCP hijacking attacks, significantly compromising the security of the affected TCP connections. Our attack involves first identifying a target TCP connection originating from the shared IP address, followed by inferring the sequence numbers of the identified connection. We thoroughly assess the impacts of our attack under various network configurations. Experimental results reveal that the attack can be executed within an average time of 220 seconds, achieving a success rate of 70%.Case studies, including SSH DoS, FTP traffic poisoning, and HTTP injection, highlight the threat it poses to various applications. Additionally, we evaluate our attack across 50 real-world networks with IP address sharing--including public Wi-Fi, VPNs, and 5G--and find 38 vulnerable. Finally, we responsibly disclose the vulnerabilities, receive recognition from organizations such as IETF, Linux, and Cisco, and propose our countermeasures.
中文: 本文揭示了路径MTU发现(PMTUD)在IP地址共享环境中的安全漏洞,攻击者可通过推断TCP序列号实施离径劫持攻击,实验显示平均220秒内达到70%成功率,38个真实网络存在风险。
English: This paper exposes security vulnerabilities in Path MTU Discovery (PMTUD) when combined with IP address sharing, enabling off-path TCP hijacking attacks that can infer sequence numbers and compromise connections with 70% success in 220 seconds on average.

Authors:Wentao Gao, Jiuyong Li, Lin Liu, Thuc Duy Le, Xiongren Chen, Xiaojing Du, Jixue Liu, Yanchang Zhao, Yun Chen
Title: From Noise to Precision: A Diffusion-Driven Approach to Zero-Inflated Precipitation Prediction
Abstract:
Zero-inflated data pose significant challenges in precipitation forecasting due to the predominance of zeros with sparse non-zero events. To address this, we propose the Zero Inflation Diffusion Framework (ZIDF), which integrates Gaussian perturbation for smoothing zero-inflated distributions, Transformer-based prediction for capturing temporal patterns, and diffusion-based denoising to restore the original data structure. In our experiments, we use observational precipitation data collected from South Australia along with synthetically generated zero-inflated data. Results show that ZIDF demonstrates significant performance improvements over multiple state-of-the-art precipitation forecasting models, achieving up to 56.7\% reduction in MSE and 21.1\% reduction in MAE relative to the baseline Non-stationary Transformer. These findings highlight ZIDF's ability to robustly handle sparse time series data and suggest its potential generalizability to other domains where zero inflation is a key challenge.
中文: 零膨胀扩散框架(ZIDF)通过高斯平滑、Transformer预测和扩散去噪技术有效解决了降水预测中的零膨胀数据难题,实现了高达56.7%的均方误差降低,并展现出在相关领域的广泛应用潜力。
English: The Zero Inflation Diffusion Framework (ZIDF) effectively addresses zero-inflated data challenges in precipitation forecasting by combining Gaussian smoothing, Transformer-based prediction, and diffusion denoising, achieving up to 56.7% MSE reduction and demonstrating strong potential for broader applications.

Authors:Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu
Title: Unsupervised Hallucination Detection by Inspecting Reasoning Processes
Abstract:
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
中文: IRIS是一种无监督框架,通过利用大语言模型的内部表征和响应不确定性来检测幻觉内容,无需标注数据即可超越现有方法。
English: IRIS is an unsupervised framework that detects hallucinations in LLM outputs by leveraging the model's internal representations and response uncertainty, outperforming existing methods without requiring labeled data.

Authors:Yanru Huo, Ziyue Jiang, Zuoli Tang, Qingyang Hong, Zhou Zhao
Title: DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
Abstract:
While Diffusion Transformers (DiT) have advanced non-autoregressive (NAR) speech synthesis, their high computational demands remain an limitation. Existing DiT-based text-to-speech (TTS) model acceleration approaches mainly focus on reducing sampling steps through distillation techniques, yet they remain constrained by training costs. We introduce DiTReducio, a training-free acceleration framework that compresses computations in DiT-based TTS models via progressive calibration. We propose two compression methods, Temporal Skipping and Branch Skipping, to eliminate redundant computations during inference. Moreover, based on two characteristic attention patterns identified within DiT layers, we devise a pattern-guided strategy to selectively apply the compression methods. Our method allows flexible modulation between generation quality and computational efficiency through adjustable compression thresholds. Experimental evaluations conducted on F5-TTS and MegaTTS 3 demonstrate that DiTReducio achieves a 75.4% reduction in FLOPs and improves the Real-Time Factor (RTF) by 37.1%, while preserving generation quality.
中文摘要:DiTReducio是一种无需训练的加速框架,通过时间跳跃和分支跳跃方法逐步压缩基于DiT的语音合成模型的计算量,在保持生成质量的同时显著提升了计算效率。
English Summary: DiTReducio is a training-free framework that accelerates DiT-based TTS models by progressively compressing computations through temporal and branch skipping methods, achieving significant efficiency gains while maintaining speech quality.

Authors:Wenhao Li, Bangcheng Sun, Weihao Ye, Tianyi Zhang, Daohai Yu, Fei Chao, Rongrong Ji
Title: CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling
Abstract:
Scaling language models to longer contexts is essential for capturing rich dependencies across extended discourse. However, naïve context extension imposes significant computational and memory burdens, often resulting in inefficiencies during both training and inference. In this work, we propose CCF, a novel context compression framework designed to enable efficient long-context modeling by learning hierarchical latent representations that preserve global semantics while aggressively reducing input redundancy. CCF integrates segment-wise semantic aggregation with key-value memory encoding, forming compact representations that support accurate reconstruction and long-range understanding. To further enhance scalability, we introduce a training-efficient optimization strategy that couples incremental segment decoding with sparse reservoir sampling, substantially reducing memory overhead without degrading performance. Empirical results on multiple long-context language modeling benchmarks demonstrate that CCF achieves competitive perplexity under high compression ratios, and significantly improves throughput and memory efficiency compared to existing approaches. These findings highlight the potential of structured compression for scalable and effective long-context language modeling.
中文摘要:CCF是一种新颖的上下文压缩框架,通过分层潜在表征和记忆编码实现高效长上下文建模,在保持竞争力的性能同时显著提升了吞吐量和内存效率。
English Summary: CCF is a novel context compression framework that enables efficient long-context modeling through hierarchical latent representations and memory encoding, achieving competitive performance with improved throughput and memory efficiency.

Authors:Gagan Mundada, Yash Vishe, Amit Namburi, Xin Xu, Zachary Novack, Julian McAuley, Junda Wu
Title: WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, their reasoning abilities in the multimodal symbolic music domain remain largely unexplored. We introduce WildScore, the first in-the-wild multimodal symbolic music reasoning and analysis benchmark, designed to evaluate MLLMs' capacity to interpret real-world music scores and answer complex musicological queries. Each instance in WildScore is sourced from genuine musical compositions and accompanied by authentic user-generated questions and discussions, capturing the intricacies of practical music analysis. To facilitate systematic evaluation, we propose a systematic taxonomy, comprising both high-level and fine-grained musicological ontologies. Furthermore, we frame complex music reasoning as multiple-choice question answering, enabling controlled and scalable assessment of MLLMs' symbolic music understanding. Empirical benchmarking of state-of-the-art MLLMs on WildScore reveals intriguing patterns in their visual-symbolic reasoning, uncovering both promising directions and persistent challenges for MLLMs in symbolic music reasoning and analysis. We release the dataset and code.
中文: WildScore是首个评估多模态大语言模型在真实乐谱和音乐学问题上符号音乐推理能力的基准,揭示了该领域的能力与挑战。
English: WildScore is the first multimodal benchmark for evaluating MLLMs' symbolic music reasoning using real-world scores and musicological queries, revealing both capabilities and challenges in this domain.

Authors:Pengyue Jia, Yingyi Zhang, Xiangyu Zhao, Yixuan Li
Title: GeoArena: An Open Platform for Benchmarking Large Vision-language Models on WorldWide Image Geolocalization
Abstract:
Image geolocalization aims to predict the geographic location of images captured anywhere on Earth, but its global nature presents significant challenges. Current evaluation methodologies suffer from two major limitations. First, data leakage: advanced approaches often rely on large vision-language models (LVLMs) to predict image locations, yet these models are frequently pretrained on the test datasets, compromising the accuracy of evaluating a model's actual geolocalization capability. Second, existing metrics primarily rely on exact geographic coordinates to assess predictions, which not only neglects the reasoning process but also raises privacy concerns when user-level location data is required. To address these issues, we propose GeoArena, a first open platform for evaluating LVLMs on worldwide image geolocalization tasks, offering true in-the-wild and human-centered benchmarking. GeoArena enables users to upload in-the-wild images for a more diverse evaluation corpus, and it leverages pairwise human judgments to determine which model output better aligns with human expectations. Our platform has been deployed online for two months, during which we collected over thousands voting records. Based on this data, we conduct a detailed analysis and establish a leaderboard of different LVLMs on the image geolocalization task.
中文摘要:图像地理定位面临数据泄露和坐标依赖的评估局限,为此我们开发了GeoArena平台,通过真实场景图像和人工评判实现更精准、保护隐私的模型评估。
English Summary: Image geolocalization faces evaluation challenges due to data leakage from pretrained models and overreliance on exact coordinates, prompting the creation of GeoArena, an open platform that uses real-world images and human judgments for more accurate and privacy-conscious benchmarking.

Authors:Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, Shuicheng Yan
Title: AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?
Abstract:
Large Language Model (LLM)-based agentic systems, often comprising multiple models, complex tool invocations, and orchestration protocols, substantially outperform monolithic agents. Yet this very sophistication amplifies their fragility, making them more prone to system failure. Pinpointing the specific agent or step responsible for an error within long execution traces defines the task of agentic system failure attribution. Current state-of-the-art reasoning LLMs, however, remain strikingly inadequate for this challenge, with accuracy generally below 10%. To address this gap, we propose AgenTracer, the first automated framework for annotating failed multi-agent trajectories via counterfactual replay and programmed fault injection, producing the curated dataset TracerTraj. Leveraging this resource, we develop AgenTracer-8B, a lightweight failure tracer trained with multi-granular reinforcement learning, capable of efficiently diagnosing errors in verbose multi-agent interactions. On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%, setting a new standard in LLM agentic failure attribution. More importantly, AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains, empowering self-correcting and self-evolving agentic AI.
中文摘要:基于大语言模型(LLM)的智能体系统虽优于单体智能体,但易因复杂性导致系统故障,现有LLM对此诊断准确率极低;为此提出的AgenTracer自动化框架通过生成标注数据集和轻量级故障追踪器,显著提升了错误归因能力并增强了系统性能。
English Summary: Large Language Model (LLM)-based agentic systems, while outperforming monolithic agents, are prone to fragility and failures, which current LLMs struggle to diagnose with low accuracy; to address this, AgenTracer is introduced as an automated framework that creates a curated dataset and a lightweight failure tracer, significantly improving error attribution and system performance.

Authors:Xiaoxiao Xu, Hao Wu, Wenhui Yu, Lantao Hu, Peng Jiang, Kun Gai
Title: Enhancing Interpretability and Effectiveness in Recommendation with Numerical Features via Learning to Contrast the Counterfactual samples
Abstract:
We propose a general model-agnostic Contrastive learning framework with Counterfactual Samples Synthesizing (CCSS) for modeling the monotonicity between the neural network output and numerical features which is critical for interpretability and effectiveness of recommender systems. CCSS models the monotonicity via a two-stage process: synthesizing counterfactual samples and contrasting the counterfactual samples. The two techniques are naturally integrated into a model-agnostic framework, forming an end-to-end training process. Abundant empirical tests are conducted on a publicly available dataset and a real industrial dataset, and the results well demonstrate the effectiveness of our proposed CCSS. Besides, CCSS has been deployed in our real large-scale industrial recommender, successfully serving over hundreds of millions users.
中文: CCSS框架提出了一种模型无关的对比学习方法,通过合成反事实样本来建模神经网络输出与数值特征之间的单调性,从而提升推荐系统的可解释性和有效性,已在公开数据集和工业场景中验证并成功服务数亿用户。
English: The CCSS framework introduces a model-agnostic contrastive learning approach using counterfactual sample synthesis to ensure monotonic relationships between neural network outputs and numerical features, enhancing recommender system interpretability and effectiveness, with proven success in both public datasets and large-scale industrial deployment serving hundreds of millions of users.

Authors:Sheldon Yu, Yuxin Xiong, Junda Wu, Xintong Li, Tong Yu, Xiang Chen, Ritwik Sinha, Jingbo Shang, Julian McAuley
Title: Explainable Chain-of-Thought Reasoning: An Empirical Analysis on State-Aware Reasoning Dynamics
Abstract:
Recent advances in chain-of-thought (CoT) prompting have enabled large language models (LLMs) to perform multi-step reasoning. However, the explainability of such reasoning remains limited, with prior work primarily focusing on local token-level attribution, such that the high-level semantic roles of reasoning steps and their transitions remain underexplored. In this paper, we introduce a state-aware transition framework that abstracts CoT trajectories into structured latent dynamics. Specifically, to capture the evolving semantics of CoT reasoning, each reasoning step is represented via spectral analysis of token-level embeddings and clustered into semantically coherent latent states. To characterize the global structure of reasoning, we model their progression as a Markov chain, yielding a structured and interpretable view of the reasoning process. This abstraction supports a range of analyses, including semantic role identification, temporal pattern visualization, and consistency evaluation.
中文: 本文提出了一种状态感知转换框架,通过谱分析和马尔可夫链将思维链推理抽象为结构化潜在动态,从而支持对语义角色和推理模式的可解释分析。
English: This paper introduces a state-aware transition framework that abstracts chain-of-thought reasoning into structured latent dynamics using spectral analysis and Markov chains, enabling interpretable analysis of semantic roles and reasoning patterns.

Authors:Emil Javurek, Valentyn Melnychuk, Jonas Schweisthal, Konstantin Hess, Dennis Frauen, Stefan Feuerriegel
Title: An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes
Abstract:
Predicting individualized potential outcomes in sequential decision-making is central for optimizing therapeutic decisions in personalized medicine (e.g., which dosing sequence to give to a cancer patient). However, predicting potential outcomes over long horizons is notoriously difficult. Existing methods that break the curse of the horizon typically lack strong theoretical guarantees such as orthogonality and quasi-oracle efficiency. In this paper, we revisit the problem of predicting individualized potential outcomes in sequential decision-making (i.e., estimating Q-functions in Markov decision processes with observational data) through a causal inference lens. In particular, we develop a comprehensive theoretical foundation for meta-learners in this setting with a focus on beneficial theoretical properties. As a result, we yield a novel meta-learner called DRQ-learner and establish that it is: (1) doubly robust (i.e., valid inference under the misspecification of one of the nuisances), (2) Neyman-orthogonal (i.e., insensitive to first-order estimation errors in the nuisance functions), and (3) achieves quasi-oracle efficiency (i.e., behaves asymptotically as if the ground-truth nuisance functions were known). Our DRQ-learner is applicable to settings with both discrete and continuous state spaces. Further, our DRQ-learner is flexible and can be used together with arbitrary machine learning models (e.g., neural networks). We validate our theoretical results through numerical experiments, thereby showing that our meta-learner outperforms state-of-the-art baselines.
Chinese: 本文提出了一种名为DRQ-learner的新型元学习器,用于序列决策中的个体化潜在结果预测,具备双重稳健性、奈曼正交性和准预言效率,并通过数值实验验证了其优于现有基准方法的性能。
English: This paper introduces the DRQ-learner, a novel meta-learner for predicting individualized potential outcomes in sequential decision-making, which offers doubly robust inference, Neyman-orthogonality, and quasi-oracle efficiency, validated through superior performance in numerical experiments.

Authors:Gang Li, Yulei Qin, Xiaoyu Tan, Dingkang Yang, Yuchen Shi, Zihan Xu, Xiang Li, Xing Sun, Ke Li
Title: RoRecomp: Enhancing Reasoning Efficiency via Rollout Response Recomposition in Reinforcement Learning
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has proven effective in eliciting complex reasoning in large language models (LLMs). However, standard RLVR training often leads to excessively verbose processes (in reasoning tasks) and inefficient exploration trajectories (in agentic settings), as outcome-only rewards provide no incentive for efficiency and the high variance in response length within relatively small rollout groups results in noisy optimization signals. To address this, we propose Rollout Response Recomposition (RoRecomp), a plug-and-play method that guides models toward concise reasoning by strategically recomposing the training data. RoRecomp separates responses into two distinct batch types: 1) priority batches, which combine short-correct and long-incorrect responses selected from online batches to provide a clear gradient signal for brevity, and 2) compensation batches, which utilize remaining responses from a replay buffer to maintain stability and prevent model collapse. To comprehensively evaluate effectiveness, we test RoRecomp across three settings where results demonstrate substantial efficiency gains: reducing reasoning length by 27.7% in zero RL training, reducing unnecessary tool calls by 46.8% while improving accuracy in agentic RL, and achieving up to 52.5% length reduction in thinking compression, all with minimal performance impact.
Chinese: 提出的RoRecomp方法通过将训练数据策略性重组成优先级和补偿批次,有效提升了可验证奖励强化学习的效率,在推理长度和工具调用方面实现了显著减少,且对性能影响极小。
English: The proposed RoRecomp method effectively enhances efficiency in reinforcement learning with verifiable rewards by strategically recomposing training data into priority and compensation batches, achieving significant reductions in reasoning length and tool usage with minimal performance impact.

Authors:Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel
Title: Overlap-Adaptive Regularization for Conditional Average Treatment Effect Estimation
Abstract:
The conditional average treatment effect (CATE) is widely used in personalized medicine to inform therapeutic decisions. However, state-of-the-art methods for CATE estimation (so-called meta-learners) often perform poorly in the presence of low overlap. In this work, we introduce a new approach to tackle this issue and improve the performance of existing meta-learners in the low-overlap regions. Specifically, we introduce Overlap-Adaptive Regularization (OAR) that regularizes target models proportionally to overlap weights so that, informally, the regularization is higher in regions with low overlap. To the best of our knowledge, our OAR is the first approach to leverage overlap weights in the regularization terms of the meta-learners. Our OAR approach is flexible and works with any existing CATE meta-learner: we demonstrate how OAR can be applied to both parametric and non-parametric second-stage models. Furthermore, we propose debiased versions of our OAR that preserve the Neyman-orthogonality of existing meta-learners and thus ensure more robust inference. Through a series of (semi-)synthetic experiments, we demonstrate that our OAR significantly improves CATE estimation in low-overlap settings in comparison to constant regularization.
中文: 本文提出重叠自适应正则化方法,通过根据重叠权重调整正则化强度来改进条件平均处理效应估计,在低重叠区域显著优于传统恒定正则化方法。
English: This paper introduces Overlap-Adaptive Regularization (OAR), a novel method that enhances conditional average treatment effect estimation by applying stronger regularization in low-overlap regions, significantly improving performance compared to constant regularization.

Authors:Zheyuan Hu, Chieh-Hsin Lai, Yuki Mitsufuji, Stefano Ermon
Title: CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models
Abstract:
Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.
中文: 本文提出的"一致性中期训练"(CMT)方法通过在预训练和最终流映射训练之间插入轻量级中间阶段,实现了稳定初始化,能以显著减少的计算资源高效生成高质量少步图像。
English: The paper introduces Consistency Mid-Training (CMT), a lightweight intermediate stage between pre-training and final flow map training that stabilizes initialization and enables efficient, high-quality few-step image generation with significantly reduced computational resources.

Authors:Wenjie Wei, Malu Zhang, Jieyuan Zhang, Ammar Belatreche, Shuai Wang, Yimeng Shan, Hanwen Liu, Honglin Cao, Guoqing Wang, Yang Yang, Haizhou Li
Title: S$^2$NN: Sub-bit Spiking Neural Networks
Abstract:
Spiking Neural Networks (SNNs) offer an energy-efficient paradigm for machine intelligence, but their continued scaling poses challenges for resource-limited deployment. Despite recent advances in binary SNNs, the storage and computational demands remain substantial for large-scale networks. To further explore the compression and acceleration potential of SNNs, we propose Sub-bit Spiking Neural Networks (S$^2$NNs) that represent weights with less than one bit. Specifically, we first establish an S$^2$NN baseline by leveraging the clustering patterns of kernels in well-trained binary SNNs. This baseline is highly efficient but suffers from \textit{outlier-induced codeword selection bias} during training. To mitigate this issue, we propose an \textit{outlier-aware sub-bit weight quantization} (OS-Quant) method, which optimizes codeword selection by identifying and adaptively scaling outliers. Furthermore, we propose a \textit{membrane potential-based feature distillation} (MPFD) method, improving the performance of highly compressed S$^2$NN via more precise guidance from a teacher model. Extensive results on vision and non-vision tasks reveal that S$^2$NN outperforms existing quantized SNNs in both performance and efficiency, making it promising for edge computing applications.
中文摘要:本文提出亚比特脉冲神经网络(S²NN),通过低于1比特的权重表示结合异常感知量化技术和膜电位特征蒸馏方法,有效解决了大规模网络部署的资源限制问题,在边缘计算应用中实现了性能与效率的双重突破。
English Summary: The paper introduces Sub-bit Spiking Neural Networks (S²NNs) that represent weights with less than one bit, addressing storage and computational challenges through outlier-aware quantization and membrane potential-based distillation, achieving superior efficiency and performance for edge computing applications.

Authors:Wei Wang, Dong-Dong Wu, Ming Li, Jingxiong Zhang, Gang Niu, Masashi Sugiyama
Title: Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms
Abstract:
Positive-unlabeled (PU) learning is a weakly supervised binary classification problem, in which the goal is to learn a binary classifier from only positive and unlabeled data, without access to negative data. In recent years, many PU learning algorithms have been developed to improve model performance. However, experimental settings are highly inconsistent, making it difficult to identify which algorithm performs better. In this paper, we propose the first PU learning benchmark to systematically compare PU learning algorithms. During our implementation, we identify subtle yet critical factors that affect the realistic and fair evaluation of PU learning algorithms. On the one hand, many PU learning algorithms rely on a validation set that includes negative data for model selection. This is unrealistic in traditional PU learning settings, where no negative data are available. To handle this problem, we systematically investigate model selection criteria for PU learning. On the other hand, the problem settings and solutions of PU learning have different families, i.e., the one-sample and two-sample settings. However, existing evaluation protocols are heavily biased towards the one-sample setting and neglect the significant difference between them. We identify the internal label shift problem of unlabeled training data for the one-sample setting and propose a simple yet effective calibration approach to ensure fair comparisons within and across families. We hope our framework will provide an accessible, realistic, and fair environment for evaluating PU learning algorithms in the future.
中文: 本文提出了首个正例-无标签(PU)学习基准,旨在解决实验设置不一致的问题,并识别了影响公平评估的关键因素,如不现实的模型选择和偏向性评估协议,提出了确保算法公平比较的解决方案。
English: This paper introduces the first benchmark for positive-unlabeled (PU) learning to address inconsistent experimental settings and identifies critical factors like unrealistic model selection and biased evaluation protocols, proposing solutions for fair algorithm comparisons.

Authors:Atakan Topaloglu, Kunyi Li, Michael Niemeyer, Nassir Navab, A. Murat Tekalp, Federico Tombari
Title: OracleGS: Grounding Generative Priors for Sparse-View Gaussian Splatting
Abstract:
Sparse-view novel view synthesis is fundamentally ill-posed due to severe geometric ambiguity. Current methods are caught in a trade-off: regressive models are geometrically faithful but incomplete, whereas generative models can complete scenes but often introduce structural inconsistencies. We propose OracleGS, a novel framework that reconciles generative completeness with regressive fidelity for sparse view Gaussian Splatting. Instead of using generative models to patch incomplete reconstructions, our "propose-and-validate" framework first leverages a pre-trained 3D-aware diffusion model to synthesize novel views to propose a complete scene. We then repurpose a multi-view stereo (MVS) model as a 3D-aware oracle to validate the 3D uncertainties of generated views, using its attention maps to reveal regions where the generated views are well-supported by multi-view evidence versus where they fall into regions of high uncertainty due to occlusion, lack of texture, or direct inconsistency. This uncertainty signal directly guides the optimization of a 3D Gaussian Splatting model via an uncertainty-weighted loss. Our approach conditions the powerful generative prior on multi-view geometric evidence, filtering hallucinatory artifacts while preserving plausible completions in under-constrained regions, outperforming state-of-the-art methods on datasets including Mip-NeRF 360 and NeRF Synthetic.
Chinese Summary: OracleGS是一种新颖框架,通过扩散模型生成完整场景并利用多视角立体模型验证几何不确定性,将生成完整性与回归保真度相结合,指导3D高斯溅射优化,在稀疏视角新视图合成中超越了现有最优方法。
English Summary: OracleGS is a novel framework that integrates generative completeness with regressive fidelity for sparse-view novel view synthesis by using a diffusion model to propose complete scenes and a multi-view stereo model to validate geometric uncertainties, guiding 3D Gaussian Splatting optimization to outperform state-of-the-art methods.

Authors:Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun
Title: Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Abstract:
Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
中文摘要:强化学习在训练智能大语言模型时面临探索与利用的平衡难题,SPEAR方法通过课程式自模仿学习框架,分阶段调控策略熵值,结合内在奖励和经验重校准,实现渐进式探索优化与训练稳定。
English Summary: Reinforcement learning (RL) faces exploration-exploitation challenges in training agentic LLMs, which SPEAR addresses through a curriculum-based self-imitation learning approach that progressively balances entropy and stabilizes training with intrinsic rewards and experience recalibration.

Authors:Baijun Cheng, Kailong Wang, Ling Shi, Haoyu Wang, Peng Di, Yao Guo, Ding Li, Xiangqun Chen
Title: Boosting Pointer Analysis With Large Language Model-Enhanced Allocation Function Detection
Abstract:
Pointer analysis is foundational for many static analysis tasks, yet its effectiveness is often hindered by imprecise modeling of heap allocations, particularly in C/C++ programs where user-defined allocation functions (AFs) are pervasive. Existing approaches largely overlook these custom allocators, leading to coarse aliasing and reduced analysis precision. In this paper, we present AFD, a novel technique that enhances pointer analysis by automatically identifying and modeling custom allocation functions. AFD employs a hybrid approach: it uses value-flow analysis to detect straightforward wrappers and leverages Large Language Models (LLMs) to reason about more complex allocation patterns with side effects. This targeted enhancement enables precise modeling of heap objects at each call site, achieving context-sensitivity-like benefits without the associated overhead. We evaluate AFD on 15 real-world C projects, identifying over 600 custom AFs. Integrating AFD into a baseline pointer analysis yields a 26x increase in modeled heap objects and a 39% reduction in alias set sizes, with only 1.4x runtime overhead. Furthermore, our enhanced analysis improves indirect call resolution and uncovers 17 previously undetected memory bugs. These results demonstrate that precise modeling of custom allocation functions offers a scalable and practical path to improving pointer analysis in large software systems.
中文: AFD通过结合值流分析和大型语言模型的混合方法,精确建模自定义分配函数,从而显著提升堆对象建模和别名分析精度,且仅带来轻微运行时开销。
English: AFD enhances pointer analysis by precisely modeling custom allocation functions through a hybrid approach of value-flow analysis and LLMs, significantly improving heap object modeling and alias precision with minimal runtime overhead.

Authors:Xuechen Liu, Xin Wang, Junichi Yamagishi
Title: Frustratingly Easy Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching
Abstract:
Modern audio deepfake detectors using foundation models and large training datasets have achieved promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches against such attacks require fine-tuning the detectors, which can be problematic when prompt response is required. This study introduces a training-free framework for zero-day audio deepfake detection based on knowledge representations, retrieval augmentation, and voice profile matching. Based on the framework, we propose simple yet effective knowledge retrieval and ensemble methods that achieve performance comparable to fine-tuned models on DeepFake-Eval-2024, without any additional model-wise training. We also conduct ablation studies on retrieval pool size and voice profile attributes, validating their relevance to the system efficacy.
中文摘要:本研究提出了一种无需训练的零样本音频深度伪造检测框架,通过知识表示和检索方法实现了与微调模型相当的性能,无需额外训练。
English Summary: This study introduces a training-free framework for detecting zero-day audio deepfakes using knowledge representations and retrieval methods, achieving performance comparable to fine-tuned models without additional training.

Authors:Ziang Luo, Kangan Qian, Jiahua Wang, Yuechen Luo, Jinyu Miao, Zheng Fu, Yunlong Wang, Sicong Jiang, Zilin Huang, Yifei Hu, Yuhao Yang, Hao Ye, Mengmeng Yang, Xiaojian Dong, Kun Jiang, Diange Yang
Title: MTRDrive: Memory-Tool Synergistic Reasoning for Robust Autonomous Driving in Corner Cases
Abstract:
Vision-Language Models(VLMs) have demonstrated significant potential for end-to-end autonomous driving, yet a substantial gap remains between their current capabilities and the reliability necessary for real-world deployment. A critical challenge is their fragility, characterized by hallucinations and poor generalization in out-of-distribution (OOD) scenarios. To bridge this gap, we introduce MTRDrive, a novel framework that integrates procedural driving experiences with a dynamic toolkit to enhance generalization and proactive decision-making. MTRDrive addresses these limitations through a closed-loop system that combines a memory-based experience retrieval mechanism with dynamic toolkits. This synergy enables the model to interact more effectively with its environment, improving both reasoning and decision-making capabilities with the help of our memory-tool synergistic reasoning. Additionally, we introduce a new benchmark based on complex Roadwork construction scenarios to rigorously evaluate zero-shot generalization. Extensive experiments demonstrate the superior effectiveness of our approach. On the public NAVSIM benchmark, our 3B-parameter MTRDrive model achieves an exceptional PDMS of 88.3 without chain-of-thought and sets a state-of-the-art performance bar on high-level planning, with a driving metric score of 79.8\% and a planning accuracy of 82.6\%. Rigorous zero-shot evaluation on the new Roadwork-VLM benchmark shows a strong ability to reason robustly in unseen scenarios, achieving a driving metric score of 80.2\%. These results highlight MTRDrive's potential to advance autonomous driving toward safer and more reliable systems.
Chinese Summary: 视觉语言模型在自动驾驶中潜力显著但可靠性不足,MTRDrive框架通过记忆与工具协同机制提升泛化与决策能力,在多项基准测试中实现了最先进的性能表现。
English Summary: Vision-Language Models show promise for autonomous driving but face reliability gaps, which MTRDrive addresses through a memory-tool synergy framework that enhances generalization and decision-making, achieving state-of-the-art performance in benchmarks.

Authors:Shuo Huang, Xingliang Yuan, Gholamreza Haffari, Lizhen Qu
Title: Zero-Shot Privacy-Aware Text Rewriting via Iterative Tree Search
Abstract:
The increasing adoption of large language models (LLMs) in cloud-based services has raised significant privacy concerns, as user inputs may inadvertently expose sensitive information. Existing text anonymization and de-identification techniques, such as rule-based redaction and scrubbing, often struggle to balance privacy preservation with text naturalness and utility. In this work, we propose a zero-shot, tree-search-based iterative sentence rewriting algorithm that systematically obfuscates or deletes private information while preserving coherence, relevance, and naturalness. Our method incrementally rewrites privacy-sensitive segments through a structured search guided by a reward model, enabling dynamic exploration of the rewriting space. Experiments on privacy-sensitive datasets show that our approach significantly outperforms existing baselines, achieving a superior balance between privacy protection and utility preservation.
中文: 本文提出了一种基于树搜索的零样本迭代重写算法,能有效隐藏文本中的隐私信息同时保持连贯性和实用性,在隐私保护与文本效用平衡方面显著优于现有方法。
English: This paper introduces a zero-shot tree-search-based iterative rewriting algorithm that effectively conceals private information in text while maintaining coherence and utility, outperforming existing methods in balancing privacy and naturalness.

Authors:Yandan Yang, Baoxiong Jia, Shujie Zhang, Siyuan Huang
Title: SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
Abstract:
Indoor scene synthesis has become increasingly important with the rise of Embodied AI, which requires 3D environments that are not only visually realistic but also physically plausible and functionally diverse. While recent approaches have advanced visual fidelity, they often remain constrained to fixed scene categories, lack sufficient object-level detail and physical consistency, and struggle to align with complex user instructions. In this work, we present SceneWeaver, a reflective agentic framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement. At its core, SceneWeaver employs a language model-based planner to select from a suite of extensible scene generation tools, ranging from data-driven generative models to visual- and LLM-based methods, guided by self-evaluation of physical plausibility, visual realism, and semantic alignment with user input. This closed-loop reason-act-reflect design enables the agent to identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations. Extensive experiments on both common and open-vocabulary room types demonstrate that SceneWeaver not only outperforms prior methods on physical, visual, and semantic metrics, but also generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation. Project website: https://scene-weaver.github.io/.
中文: SceneWeaver提出了一种反思性智能体框架,通过迭代优化整合多样化的场景合成工具,在物理合理性、视觉真实性和语义对齐方面超越现有方法,并能泛化至复杂场景。
English: SceneWeaver introduces a reflective agentic framework that unifies diverse scene synthesis tools through iterative refinement, outperforming prior methods in physical plausibility, visual realism, and semantic alignment while generalizing to complex scenes.

Authors:Suqing Wang, Zuchao Li, Luohe Shi, Bo Du, Hai Zhao, Yun Li, Qianren Wang
Title: From Parameters to Performance: A Data-Driven Study on LLM Structure and Development
Abstract:
Large language models (LLMs) have achieved remarkable success across various domains, driving significant technological advancements and innovations. Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce. To address this gap, we present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks. Leveraging this dataset, we conduct a systematic, data mining-driven analysis to validate and quantify the relationship between structural configurations and performance. Our study begins with a review of the historical development of LLMs and an exploration of potential future trends. We then analyze how various structural choices impact performance across benchmarks and further corroborate our findings using mechanistic interpretability techniques. By providing data-driven insights into LLM optimization, our work aims to guide the targeted development and application of future models. We will release our dataset at https://huggingface.co/datasets/DX0369/LLM-Structure-Performance-Dataset
中文: 本研究通过大规模数据集和系统性分析,量化了结构配置对大型语言模型性能的影响,为未来模型的优化开发提供了数据驱动的指导。
English: This study introduces a large-scale dataset and conducts a systematic analysis to quantify how structural configurations impact LLM performance, providing data-driven insights for optimizing future model development.

Authors:Yu Liu, Baoxiong Jia, Ruijie Lu, Chuyue Gan, Huayu Chen, Junfeng Ni, Song-Chun Zhu, Siyuan Huang
Title: VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video
Abstract:
Building digital twins of articulated objects from monocular video presents an essential challenge in computer vision, which requires simultaneous reconstruction of object geometry, part segmentation, and articulation parameters from limited viewpoint inputs. Monocular video offers an attractive input format due to its simplicity and scalability; however, it's challenging to disentangle the object geometry and part dynamics with visual supervision alone, as the joint movement of the camera and parts leads to ill-posed estimation. While motion priors from pre-trained tracking models can alleviate the issue, how to effectively integrate them for articulation learning remains largely unexplored. To address this problem, we introduce VideoArtGS, a novel approach that reconstructs high-fidelity digital twins of articulated objects from monocular video. We propose a motion prior guidance pipeline that analyzes 3D tracks, filters noise, and provides reliable initialization of articulation parameters. We also design a hybrid center-grid part assignment module for articulation-based deformation fields that captures accurate part motion. VideoArtGS demonstrates state-of-the-art performance in articulation and mesh reconstruction, reducing the reconstruction error by about two orders of magnitude compared to existing methods. VideoArtGS enables practical digital twin creation from monocular video, establishing a new benchmark for video-based articulated object reconstruction. Our work is made publicly available at: https://videoartgs.github.io.
中文摘要:VideoArtGS提出了一种从单目视频构建关节物体高保真数字孪生的创新方法,通过整合运动先验和混合部件分配模块,在关节重建和网格恢复方面达到最先进水平,将重建误差降低了约两个数量级。
English Summary: VideoArtGS introduces a novel method for creating high-fidelity digital twins of articulated objects from monocular video by integrating motion priors and hybrid part assignment, achieving state-of-the-art reconstruction accuracy with a significant reduction in error.

Authors:Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, Haizhou Li
Title: Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling
Abstract:
The explosive growth in sequence length has intensified the demand for effective and efficient long sequence modeling. Benefiting from intrinsic oscillatory membrane dynamics, Resonate-and-Fire (RF) neurons can efficiently extract frequency components from input signals and encode them into spatiotemporal spike trains, making them well-suited for long sequence modeling. However, RF neurons exhibit limited effective memory capacity and a trade-off between energy efficiency and training speed on complex temporal tasks. Inspired by the dendritic structure of biological neurons, we propose a Dendritic Resonate-and-Fire (D-RF) model, which explicitly incorporates a multi-dendritic and soma architecture. Each dendritic branch encodes specific frequency bands by utilizing the intrinsic oscillatory dynamics of RF neurons, thereby collectively achieving comprehensive frequency representation. Furthermore, we introduce an adaptive threshold mechanism into the soma structure that adjusts the threshold based on historical spiking activity, reducing redundant spikes while maintaining training efficiency in long sequence tasks. Extensive experiments demonstrate that our method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training. These results underscore its potential as an effective and efficient solution for long sequence modeling on edge platforms.
中文摘要:树突谐振发放模型通过多树突频率编码和自适应阈值机制,在保持计算效率的同时以稀疏脉冲实现竞争性精度,为长序列建模提供了高效解决方案。
English Summary: The Dendritic Resonate-and-Fire model enhances long sequence modeling by incorporating multi-dendritic encoding of frequency bands and an adaptive threshold mechanism, achieving competitive accuracy with sparse spikes while maintaining computational efficiency.

Authors:Bo Yin, Xingyi Yang, Xinchao Wang
Title: Don't Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning
Abstract:
Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt weight matrices while keeping activation functions fixed. We introduce \textbf{NoRA}, the first PEFT framework that directly adapts nonlinear activation functions in pretrained transformer-based models. NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients, with a group-wise design that localizes adaptation and improves stability at minimal cost. On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4\% of parameters (0.02M), achieving accuracy gains of +0.17\% and +0.27\%. When combined with LoRA (\textbf{NoRA++}), it outperforms LoRA and DoRA under matched training budgets by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++ consistently improves generation quality, yielding average MMLU gains of +0.3\%--0.8\%, including +1.6\% on STEM (Alpaca) and +1.3\% on OpenOrca. We further show that NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction. These results establish activation-space tuning as a complementary and highly parameter-efficient alternative to weight-based PEFT, positioning activation functions as first-class objects for model adaptation.
中文: NoRA是首个通过可学习有理函数和结构化低秩更新来调整预训练模型中非线性激活函数的参数高效微调框架,仅需更新0.4%参数即可达到或超越全参数微调效果,并在语言模型中显著提升生成质量。
English: NoRA is the first parameter-efficient fine-tuning framework that adapts nonlinear activation functions in pretrained transformers using learnable rational functions with structured low-rank updates, achieving comparable or superior performance to full fine-tuning while updating only 0.4% parameters and demonstrating enhanced generation quality in language models.

Authors:Leyi Pan, Sheng Guan, Zheyu Fu, Luyang Si, Zian Wang, Xuming Hu, Irwin King, Philip S. Yu, Aiwei Liu, Lijie Wen
Title: MarkDiffusion: An Open-Source Toolkit for Generative Watermarking of Latent Diffusion Models
Abstract:
We introduce MarkDiffusion, an open-source Python toolkit for generative watermarking of latent diffusion models. It comprises three key components: a unified implementation framework for streamlined watermarking algorithm integrations and user-friendly interfaces; a mechanism visualization suite that intuitively showcases added and extracted watermark patterns to aid public understanding; and a comprehensive evaluation module offering standard implementations of 24 tools across three essential aspects - detectability, robustness, and output quality - plus 8 automated evaluation pipelines. Through MarkDiffusion, we seek to assist researchers, enhance public awareness and engagement in generative watermarking, and promote consensus while advancing research and applications.
中文: MarkDiffusion 是一个用于隐扩散模型生成水印的开源 Python 工具包,集成了统一框架、可视化组件和评估模块,旨在促进研究和提升公众认知与参与。
English: MarkDiffusion is an open-source Python toolkit designed for generative watermarking in latent diffusion models, featuring a unified framework, visualization tools, and comprehensive evaluation modules to support research and public engagement.

Authors:Che Liu, Yinda Chen, Haoyuan Shi, Jinpeng Lu, Bailiang Jian, Jiazhen Pan, Linghan Cai, Jiayi Wang, Yundi Zhang, Jun Li, Cosmin I. Bercea, Cheng Ouyang, Chen Chen, Zhiwei Xiong, Benedikt Wiestler, Christian Wachinger, Daniel Rueckert, Wenjia Bai, Rossella Arcucci
Title: Does DINOv3 Set a New Medical Vision Standard?
Abstract:
The advent of large-scale vision foundation models, pre-trained on diverse natural images, has marked a paradigm shift in computer vision. However, how the frontier vision foundation models' efficacies transfer to specialized domains remains such as medical imaging remains an open question. This report investigates whether DINOv3, a state-of-the-art self-supervised vision transformer (ViT) that features strong capability in dense prediction tasks, can directly serve as a powerful, unified encoder for medical vision tasks without domain-specific pre-training. To answer this, we benchmark DINOv3 across common medical vision tasks, including 2D/3D classification and segmentation on a wide range of medical imaging modalities. We systematically analyze its scalability by varying model sizes and input image resolutions. Our findings reveal that DINOv3 shows impressive performance and establishes a formidable new baseline. Remarkably, it can even outperform medical-specific foundation models like BiomedCLIP and CT-Net on several tasks, despite being trained solely on natural images. However, we identify clear limitations: The model's features degrade in scenarios requiring deep domain specialization, such as in Whole-Slide Pathological Images (WSIs), Electron Microscopy (EM), and Positron Emission Tomography (PET). Furthermore, we observe that DINOv3 does not consistently obey scaling law in the medical domain; performance does not reliably increase with larger models or finer feature resolutions, showing diverse scaling behaviors across tasks. Ultimately, our work establishes DINOv3 as a strong baseline, whose powerful visual features can serve as a robust prior for multiple complex medical tasks. This opens promising future directions, such as leveraging its features to enforce multiview consistency in 3D reconstruction.
中文: DINOv3在无需领域特定训练的情况下,作为医学视觉任务的统一编码器表现出色,在某些任务中超越专业模型,但在高度专业化领域存在局限且缩放行为不一致。
English: DINOv3 demonstrates strong performance as a unified encoder for medical vision tasks without domain-specific training, outperforming specialized models in some cases but showing limitations in highly specialized domains and inconsistent scaling behaviors.

Authors:Qianheng Zhang, Song Gao, Chen Wei, Yibo Zhao, Ying Nie, Ziru Chen, Shijie Chen, Yu Su, Huan Sun
Title: GeoAnalystBench: A GeoAI benchmark for assessing large language models for spatial analysis workflow and code generation
Abstract:
Recent advances in large language models (LLMs) have fueled growing interest in automating geospatial analysis and GIS workflows, yet their actual capabilities remain uncertain. In this work, we call for rigorous evaluation of LLMs on well-defined geoprocessing tasks before making claims about full GIS automation. To this end, we present GeoAnalystBench, a benchmark of 50 Python-based tasks derived from real-world geospatial problems and carefully validated by GIS experts. Each task is paired with a minimum deliverable product, and evaluation covers workflow validity, structural alignment, semantic similarity, and code quality (CodeBLEU). Using this benchmark, we assess both proprietary and open source models. Results reveal a clear gap: proprietary models such as ChatGPT-4o-mini achieve high validity 95% and stronger code alignment (CodeBLEU 0.39), while smaller open source models like DeepSeek-R1-7B often generate incomplete or inconsistent workflows (48.5% validity, 0.272 CodeBLEU). Tasks requiring deeper spatial reasoning, such as spatial relationship detection or optimal site selection, remain the most challenging across all models. These findings demonstrate both the promise and limitations of current LLMs in GIS automation and provide a reproducible framework to advance GeoAI research with human-in-the-loop support.
中文: 大语言模型的最新进展激发了地理空间分析自动化的兴趣,但通过GeoAnalystBench基准测试发现,专有模型与开源模型在性能上存在显著差距,尤其在复杂空间推理任务中,凸显了当前模型在地理信息系统自动化中的潜力与局限。
English: Recent advances in large language models have spurred interest in automating geospatial analysis, but their capabilities require rigorous evaluation, as demonstrated by the GeoAnalystBench benchmark, which reveals significant performance gaps between proprietary and open-source models, particularly in complex spatial reasoning tasks.

Authors:JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao
Title: From Editor to Dense Geometry Estimator
Abstract:
Leveraging visual priors from pre-trained text-to-image (T2I) generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than T2I generative models, may be a more suitable foundation for fine-tuning. Motivated by this, we conduct a systematic analysis of the fine-tuning behaviors of both editors and generators for dense geometry estimation. Our findings show that editing models possess inherent structural priors, which enable them to converge more stably by ``refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce \textbf{FE2E}, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. Specifically, to tailor the editor for this deterministic task, we reformulate the editor's original flow matching loss into the ``consistent velocity" training objective. And we use logarithmic quantization to resolve the precision conflict between the editor's native BFloat16 format and the high precision demand of our tasks. Additionally, we leverage the DiT's global attention for a cost-free joint estimation of depth and normals in a single forward pass, enabling their supervisory signals to mutually enhance each other. Without scaling up the training data, FE2E achieves impressive performance improvements in zero-shot monocular depth and normal estimation across multiple datasets. Notably, it achieves over 35\% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$\times$ data. The project page can be accessed \href{https://amap-ml.github.io/FE2E/}{here}.
中文: 研究表明,相比文本到图像生成模型,微调图像编辑模型因其固有的结构先验能更有效地进行密集几何估计,由此开发的FE2E框架无需增加训练数据即可实现显著性能提升。
English: The study demonstrates that fine-tuning image editing models, rather than text-to-image generators, yields superior dense geometry estimation due to their inherent structural priors, leading to the development of the FE2E framework that achieves significant performance gains without additional training data.

Authors:Jiaxin Guo, Daimeng Wei, Yuanchang Luo, Xiaoyu Chen, Zhanglin Wu, Huan Yang, Hengchao Shang, Zongyao Li, Zhiqiang Rao, Jinlong Yang, Hao Yang
Title: Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation
Abstract:
Large language models (LLMs) have ushered in a new era for document-level machine translation (\textit{doc}-mt), yet their whole-document outputs challenge existing evaluation methods that assume sentence-by-sentence alignment. We introduce \textit{\textbf{Align-then-Slide}}, a complete evaluation framework for ultra-long doc-mt. In the Align stage, we automatically infer sentence-level source-target correspondences and rebuild the target to match the source sentence number, resolving omissions and many-to-one/one-to-many mappings. In the n-Chunk Sliding Evaluate stage, we calculate averaged metric scores under 1-, 2-, 3- and 4-chunk for multi-granularity assessment. Experiments on the WMT benchmark show a Pearson correlation of 0.929 between our method with expert MQM rankings. On a newly curated real-world test set, our method again aligns closely with human judgments. Furthermore, preference data produced by Align-then-Slide enables effective CPO training and its direct use as a reward model for GRPO, both yielding translations preferred over a vanilla SFT baseline. The results validate our framework as an accurate, robust, and actionable evaluation tool for doc-mt systems.
中文摘要:Align-then-Slide框架通过源语-目标语句子对齐和多分块滑动评估,解决了文档级机器翻译的评估难题,实验证明其与人工评判高度一致,并能有效提升翻译模型的训练效果。
English Summary: The Align-then-Slide framework addresses evaluation challenges in document-level machine translation by aligning source-target sentences and using multi-chunk sliding assessment, demonstrating high correlation with human judgments and enabling improved translation training.

Authors:Carlo Fabrizio, Gianvito Losapio, Marco Mussi, Alberto Maria Metelli, Marcello Restelli
Title: Power Grid Control with Graph-Based Distributed Reinforcement Learning
Abstract:
The necessary integration of renewable energy sources, combined with the expanding scale of power networks, presents significant challenges in controlling modern power grids. Traditional control systems, which are human and optimization-based, struggle to adapt and to scale in such an evolving context, motivating the exploration of more dynamic and distributed control strategies. This work advances a graph-based distributed reinforcement learning framework for real-time, scalable grid management. The proposed architecture consists of a network of distributed low-level agents acting on individual power lines and coordinated by a high-level manager agent. A Graph Neural Network (GNN) is employed to encode the network's topological information within the single low-level agent's observation. To accelerate convergence and enhance learning stability, the framework integrates imitation learning and potential-based reward shaping. In contrast to conventional decentralized approaches that decompose only the action space while relying on global observations, this method also decomposes the observation space. Each low-level agent acts based on a structured and informative local view of the environment constructed through the GNN. Experiments on the Grid2Op simulation environment show the effectiveness of the approach, which consistently outperforms the standard baseline commonly adopted in the field. Additionally, the proposed model proves to be much more computationally efficient than the simulation-based Expert method.
中文: 本研究提出了一种基于图结构的分布式强化学习框架,通过由管理器协调的分布式代理并利用图神经网络进行局部观测,实现了可扩展且高效的实时电网管理,其性能和计算效率均优于传统方法。
English: This study introduces a graph-based distributed reinforcement learning framework for scalable and efficient real-time power grid management, utilizing distributed agents coordinated by a manager and employing Graph Neural Networks for local observations, which outperforms traditional methods in both effectiveness and computational efficiency.

Authors:Zhiwei Zhang, Ruikai Xu, Weijian Zhang, Zhizhong Zhang, Xin Tan, Jingyu Gong, Yuan Xie, Lizhuang Ma
Title: PFDepth: Heterogeneous Pinhole-Fisheye Joint Depth Estimation via Distortion-aware Gaussian-Splatted Volumetric Fusion
Abstract:
In this paper, we present the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, PFDepth. Our key insight is to exploit the complementary characteristics of pinhole and fisheye imagery (undistorted vs. distorted, small vs. large FOV, far vs. near field) for joint optimization. PFDepth employs a unified architecture capable of processing arbitrary combinations of pinhole and fisheye cameras with varied intrinsics and extrinsics. Within PFDepth, we first explicitly lift 2D features from each heterogeneous view into a canonical 3D volumetric space. Then, a core module termed Heterogeneous Spatial Fusion is designed to process and fuse distortion-aware volumetric features across overlapping and non-overlapping regions. Additionally, we subtly reformulate the conventional voxel fusion into a novel 3D Gaussian representation, in which learnable latent Gaussian spheres dynamically adapt to local image textures for finer 3D aggregation. Finally, fused volume features are rendered into multi-view depth maps. Through extensive experiments, we demonstrate that PFDepth sets a state-of-the-art performance on KITTI-360 and RealHet datasets over current mainstream depth networks. To the best of our knowledge, this is the first systematic study of heterogeneous pinhole-fisheye depth estimation, offering both technical novelty and valuable empirical insights.
中文: 本文提出首个针孔-鱼眼异构多视角深度估计框架PFDepth,通过统一架构结合新型3D高斯表示法,充分利用两种摄像机的互补特性,在多个数据集上实现了最先进的性能表现。
English: This paper introduces PFDepth, the first pinhole-fisheye framework for heterogeneous multi-view depth estimation, leveraging complementary characteristics of both camera types through a unified architecture with novel 3D Gaussian representation and achieving state-of-the-art performance.

Authors:Jun Rao, Yunjie Liao, Xuebo Liu, Zepeng Lin, Lian Lian, Dong Jin, Shengjun Cheng, Jun Yu, Min Zhang
Title: SeaPO: Strategic Error Amplification for Robust Preference Optimization of Large Language Models
Abstract:
Existing alignment methods for preference optimization of large language models (LLMs) aim to enhance model performance by utilizing pairs of positive and negative samples. However, due to the limited capacity of models in scoring or generating responses, the quality of positive and negative samples may become similar during training, which complicates optimization for preference learning. To address this issue, we introduce SeaPO, a Strategic Error Amplification method that leverages three error types commonly occurring in LLMs to introduce specific error patterns into the model Preference Optimization. This strategy ensures that negative samples are more erroneous than positive samples and preference-based training is employed to mitigate the occurrence of these errors, thereby enhancing model performance. Evaluations across five capability dimensions and different model scales (1.5B to 14B) demonstrate that the generated data significantly improved overall model performance, particularly in terms of truthfulness, with improvements of 5-10 percentage points observed. Further analysis reveals that task performance varies depending on the error types introduced. Injecting the most common error types improves performance in related tasks, while a mix of error types leads to a broader performance enhancement: most tasks show stable improvements, while a few tasks exhibit significant gains.
中文摘要:提出的SeaPO方法通过在偏好优化中策略性地放大语言模型常见错误,确保负样本比正样本更具错误性,从而显著提升模型在多维能力上的表现,尤其在真实性方面提高了5-10个百分点。
English Summary: The proposed SeaPO method strategically amplifies common errors in LLMs during preference optimization to ensure clearer distinction between positive and negative samples, significantly enhancing model performance across multiple dimensions, particularly truthfulness, by 5-10 percentage points.

Authors:Wenxuan Wang, Yongjiang Wu, Junyuan Zhang, Shuqing Li, Yun Peng, Wenting Chen, Shuai Wang, Michael R. Lyu
Title: Metamorphic Testing for Audio Content Moderation Software
Abstract:
The rapid growth of audio-centric platforms and applications such as WhatsApp and Twitter has transformed the way people communicate and share audio content in modern society. However, these platforms are increasingly misused to disseminate harmful audio content, such as hate speech, deceptive advertisements, and explicit material, which can have significant negative consequences (e.g., detrimental effects on mental health). In response, researchers and practitioners have been actively developing and deploying audio content moderation tools to tackle this issue. Despite these efforts, malicious actors can bypass moderation systems by making subtle alterations to audio content, such as modifying pitch or inserting noise. Moreover, the effectiveness of modern audio moderation tools against such adversarial inputs remains insufficiently studied. To address these challenges, we propose MTAM, a Metamorphic Testing framework for Audio content Moderation software. Specifically, we conduct a pilot study on 2000 audio clips and define 14 metamorphic relations across two perturbation categories: Audio Features-Based and Heuristic perturbations. MTAM applies these metamorphic relations to toxic audio content to generate test cases that remain harmful while being more likely to evade detection. In our evaluation, we employ MTAM to test five commercial textual content moderation software and an academic model against three kinds of toxic content. The results show that MTAM achieves up to 38.6%, 18.3%, 35.1%, 16.7%, and 51.1% error finding rates (EFR) when testing commercial moderation software provided by Gladia, Assembly AI, Baidu, Nextdata, and Tencent, respectively, and it obtains up to 45.7% EFR when testing the state-of-the-art algorithms from the academy.
Chinese: 针对音频平台被滥用于传播有害内容的问题,本研究提出MTAM蜕变测试框架,通过生成对抗性测试用例有效识别音频审核工具的漏洞,在商业系统中最高可实现51.1%的错误发现率。
English: To combat the misuse of audio platforms for harmful content, this study introduces MTAM, a metamorphic testing framework that effectively identifies vulnerabilities in audio moderation tools by generating adversarial test cases, achieving error finding rates of up to 51.1% in commercial systems.

Authors:Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo
Title: Pretraining Scaling Laws for Generative Evaluations of Language Models
Abstract:
Neural scaling laws have played a central role in modern machine learning, driving the field's ever-expanding scaling of parameters, data and compute. While much research has gone into fitting scaling laws and predicting performance on pretraining losses and on discriminative evaluations such as multiple-choice question-answering, comparatively little research has been done on fitting scaling laws and predicting performance on generative evaluations such as mathematical problem-solving or software engineering. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using the performance of cheaper models. Our three scaling laws differ in the covariates used: (1) compute, (2) model parameters and tokens, (3) log likelihoods of gold reference solutions. We make four main contributions: (1) We show how generative evaluations offer new hyperparameters (in our setting, $k$) that researchers can use to control the scaling laws parameters and the predictability of performance. (2) In terms of scaling law parameters, we find that the compute scaling law and parameters\,+\,tokens scaling law stabilize for the last ~$1.5{-}2.5$ orders of magnitude, whereas the gold reference likelihood scaling law stabilizes for the last ~$5$ orders of magnitude. (3) In terms of predictive performance, we find all three scaling laws perform comparably, although the compute scaling law predicts slightly worse for small $k$ and the log likelihoods of gold reference solutions predicts slightly worse for large $k$. (4) We establish a theoretical connection that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens scaling law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.
Chinese: 本研究针对生成式评估提出并评估了三种预训练扩展定律,揭示了它们在性能预测中的有效性,并展示了在不同参数下稳定性与预测精度的差异。
English: This research introduces and assesses three pretraining scaling laws for generative evaluations, demonstrating their effectiveness in predicting performance and highlighting their varying stability and predictive accuracy across different parameters.

Authors:Rylan Schaeffer, Noam Levi, Andreas Kirsch, Theo Guenais, Brando Miranda, Elyas Obbad, Sanmi Koyejo
Title: Evaluating the Robustness of Chinchilla Compute-Optimal Scaling
Abstract:
Hoffman et al (2022)'s Chinchilla paper introduced the principle of compute-optimal scaling, laying a foundation for future scaling of language models. In the years since, however, valid concerns about Chinchilla have been raised: wide confidence intervals, discrepancies between its three approaches, and incongruities with other scaling laws. This raises a critical question for the field: Can practitioners still rely on Chinchilla's prescriptions? Our work demonstrates the answer is yes. We begin by uncovering that the model parameters central to Chinchilla's analyses were ambiguous: three interpretations are possible, with relative differences between different interpretations of model parameters as high as 15.2%. We find that, perhaps surprisingly, which model parameters are used for the analyses do not meaningfully affect key results: the scaling law estimates and the compute-optimal tokens-to-parameter ratio. Indeed, under one interpretation, the tokens-to-parameter ratio becomes more constant with the target compute budget. We then ask how distorted the Chinchilla model parameters could have been without meaningfully affecting the key results. By deliberately perturbing model parameters in four structured ways, we find that key Chinchilla results are most sensitive to additive or systematic errors, which can alter the otherwise flat trend of the optimal tokens-to-parameter ratio, but overall, Chinchilla's key results withstand sizable perturbations. Altogether, our findings offer the field renewed confidence in Chinchilla as a durable guide for scaling language models.
Chinese: Hoffman等人的Chinchilla论文提出了计算最优缩放原则,尽管对其方法存在质疑,本研究证实其作为语言模型缩放的指导具有持久可靠性。
English: Hoffman et al.'s Chinchilla paper established compute-optimal scaling principles, and despite concerns about its methodology, this study confirms its reliability as a durable guide for scaling language models.

Authors:Shaobo Wang, Jiaming Wang, Jiajun Zhang, Cong Wang, Yue Min, Zichen Wen, Fei Huang, Huiqiang Jiang, Junyang Lin, Dayiheng Liu, Linfeng Zhang
Title: Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning
Abstract:
As supervised fine-tuning (SFT) evolves from a lightweight post-training step into a compute-intensive phase rivaling mid-training in scale, data efficiency has become critical for aligning large language models (LLMs) under tight budgets. Existing data pruning methods suffer from a fragmented design: they operate either at the sample level or the token level in isolation, failing to jointly optimize both dimensions. This disconnect leads to significant inefficiencies--high-value samples may still contain redundant tokens, while token-level pruning often discards crucial instructional or corrective signals embedded in individual examples. To address this bottleneck, we introduce the Error-Uncertainty (EU) Plane, a diagnostic framework that jointly characterizes the heterogeneous utility of training data across samples and tokens. Guided by this insight, we propose Quadrant-based Tuning (Q-Tuning), a unified framework that strategically coordinates sample pruning and token pruning. Q-Tuning employs a two-stage strategy: first, it performs sample-level triage to retain examples rich in informative misconceptions or calibration signals; second, it applies an asymmetric token-pruning policy, using a context-aware scoring mechanism to trim less salient tokens exclusively from misconception samples while preserving calibration samples in their entirety. Our method sets a new state of the art across five diverse benchmarks. Remarkably, on SmolLM2-1.7B, Q-Tuning achieves a +38\% average improvement over the full-data SFT baseline using only 12.5\% of the original training data. As the first dynamic pruning approach to consistently outperform full-data training, Q-Tuning provides a practical and scalable blueprint for maximizing data utilization in budget-constrained LLM SFT.
中文: 为解决大语言模型监督微调中数据剪枝的低效问题,我们提出了误差-不确定性平面和Q调优框架,通过协同样本与词元剪枝策略,在极低数据用量下实现了性能突破。
English: To address the inefficiencies in data pruning for supervised fine-tuning of large language models, we introduce the Error-Uncertainty Plane and Q-Tuning, a unified framework that strategically coordinates sample and token pruning, achieving state-of-the-art performance with significantly reduced data usage.

Authors:Pengxiang Li, Zechen Hu, Zirui Shang, Jingrong Wu, Yang Liu, Hui Liu, Zhi Gao, Chenrui Shi, Bofei Zhang, Zihao Zhang, Xiaochuan Shi, Zedong YU, Yuwei Wu, Xinxiao Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
Title: Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation
Abstract:
Vision-language model (VLM) based GUI agents show promise for automating complex desktop and mobile tasks, but face significant challenges in applying reinforcement learning (RL): (1) slow multi-turn interactions with GUI environments for policy rollout, and (2) insufficient high-quality agent-environment interactions for policy learning. To address these challenges, we propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner. DART separates the training system into four asynchronous modules: environment cluster, rollout service, data manager, and trainer. This design enables non-blocking communication, asynchronous training, rollout-wise trajectory sampling, and per-worker model synchronization, significantly improving the system efficiency: 1.6*GPU utilization for rollout, 1.9* training throughput, and 5.5* environment utilization. To facilitate effective learning from abundant samples, we introduce an adaptive data curation scheme: (1) pre-collecting successful trajectories for challenging tasks to supplement sparse success in online sampling; (2) dynamically adjusting rollout numbers and trajectory lengths based on task difficulty; (3) training selectively on high-entropy steps to prioritize critical decisions; (4) stabilizing learning via truncated importance sampling for policy mismatch between policy rollout and updating. On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA. We will fully open-source our training framework, data, and model checkpoints via computer-use-agents.github.io/dart-gui, which we believe is a timely contribution to the open-source community of agentic RL training.
中文: DART提出了一种解耦的GUI智能体强化学习框架,通过异步训练和自适应数据管理有效解决了强化学习中的效率问题,显著提升了系统性能并在OSWorld基准测试中取得了领先成果。
English: DART introduces a decoupled agentic reinforcement learning framework for GUI agents, addressing RL challenges by enabling asynchronous training and adaptive data curation, which significantly boosts system efficiency and achieves state-of-the-art performance on the OSWorld benchmark.

Authors:Chao Wang, Rui-Chen Zheng, Yang Ai, Zhen-Hua Ling
Title: Understanding Textual Capability Degradation in Speech LLMs via Parameter Importance Analysis
Abstract:
The integration of speech into Large Language Models (LLMs) has substantially expanded their capabilities, but often at the cost of weakening their core textual competence. This degradation limits the ability of speech-enabled LLMs to fully exploit their pre-trained text-based knowledge. In this work, we analyze the underlying mechanisms of this issue through a focused study of the widely used encoder-adaptor paradigm. We propose an analytical framework based on parameter importance estimation, which reveals that fine-tuning for speech introduces a textual importance distribution shift: the layer-wise allocation of parameters critical to textual reasoning is disrupted. Building on this insight, we investigate two mitigation strategies: layer-wise learning rate scheduling and Low-Rank Adaptation (LoRA), both aim to preserve the original parameter distribution. Experimental results show that both approaches better maintain textual competence than full fine-tuning, while also improving downstream spoken question answering performance. Furthermore, our analysis offers a principled explanation for the effectiveness of the proposed mitigation strategies, linking their benefits to the structural properties of textual knowledge in LLMs.
中文: 语音集成到大型语言模型中常削弱其文本能力,但采用分层学习率或低秩适应微调可保持文本理解力,同时提升口语问答性能。
English: The integration of speech into LLMs often weakens their textual abilities, but using layer-wise learning rates or LoRA during fine-tuning helps preserve text competence while enhancing spoken question answering.

Authors:Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, David Lo
Title: SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios
Abstract:
Large language model (LLM) powered code agents are rapidly transforming software engineering by automating tasks such as testing, debugging, and repairing, yet the security risks of their generated code have become a critical concern. Existing benchmarks have offered valuable insights but remain insufficient: they often overlook the genuine context in which vulnerabilities were introduced or adopt narrow evaluation protocols that fail to capture either functional correctness or newly introduced vulnerabilities. We therefore introduce SecureAgentBench, a benchmark of 105 coding tasks designed to rigorously evaluate code agents' capabilities in secure code generation. Each task includes (i) realistic task settings that require multi-file edits in large repositories, (ii) aligned contexts based on real-world open-source vulnerabilities with precisely identified introduction points, and (iii) comprehensive evaluation that combines functionality testing, vulnerability checking through proof-of-concept exploits, and detection of newly introduced vulnerabilities using static analysis. We evaluate three representative agents (SWE-agent, OpenHands, and Aider) with three state-of-the-art LLMs (Claude 3.7 Sonnet, GPT-4.1, and DeepSeek-V3.1). Results show that (i) current agents struggle to produce secure code, as even the best-performing one, SWE-agent supported by DeepSeek-V3.1, achieves merely 15.2% correct-and-secure solutions, (ii) some agents produce functionally correct code but still introduce vulnerabilities, including new ones not previously recorded, and (iii) adding explicit security instructions for agents does not significantly improve secure coding, underscoring the need for further research. These findings establish SecureAgentBench as a rigorous benchmark for secure code generation and a step toward more reliable software development with LLMs.
中文:大型语言模型驱动的代码代理正在推动软件工程发展,但也带来了严重的安全风险,为此我们推出了SecureAgentBench这一综合基准测试,结果表明现有代理即使能生成功能正确的代码,仍难以确保安全性。
English: Large language model-powered code agents are advancing software engineering but pose significant security risks, prompting the introduction of SecureAgentBench, a comprehensive benchmark that reveals current agents' struggles to produce secure code despite functional correctness.

Authors:Jianhan Wu, Xiaoyang Qu, Zhangcheng Huang, Jianzong Wang
Title: Federated Domain Generalization with Domain-specific Soft Prompts Generation
Abstract:
Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. engendering domain shift among clients and posing a formidable challenge for downstream-task adaptation. Existing federated domain generalization (FDG) methods based on prompt learning typically learn soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts exhibit limited diversity and tend to ignore information from unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, during training, we introduce domain-specific soft prompts (DSPs) for each domain and integrate content and domain knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Comprehensive evaluations across several public datasets confirm that our method outperforms existing strong baselines in FDG, achieving state-of-the-art results.
中文摘要:提示学习以少量参数高效地将CLIP适配至下游任务,而提出的FedDSPG方法通过生成领域特定软提示增强联邦领域泛化能力,在多个公开数据集上实现了最优性能。
English Summary: Prompt learning efficiently adapts CLIP to downstream tasks with minimal parameters, and the proposed FedDSPG method enhances federated domain generalization by generating domain-specific soft prompts, achieving state-of-the-art performance across multiple datasets.

Authors:Changze Lv, Yifei Wang, Yanxun Zhang, Yiyang Lu, Jingwen Xu, Di Yu, Xin Du, Xuanjing Huang, Xiaoqing Zheng
Title: Biologically Plausible Learning via Bidirectional Spike-Based Distillation
Abstract:
Developing biologically plausible learning algorithms that can achieve performance comparable to error backpropagation remains a longstanding challenge. Existing approaches often compromise biological plausibility by entirely avoiding the use of spikes for error propagation or relying on both positive and negative learning signals, while the question of how spikes can represent negative values remains unresolved. To address these limitations, we introduce Bidirectional Spike-based Distillation (BSD), a novel learning algorithm that jointly trains a feedforward and a backward spiking network. We formulate learning as a transformation between two spiking representations (i.e., stimulus encoding and concept encoding) so that the feedforward network implements perception and decision-making by mapping stimuli to actions, while the backward network supports memory recall by reconstructing stimuli from concept representations. Extensive experiments on diverse benchmarks, including image recognition, image generation, and sequential regression, show that BSD achieves performance comparable to networks trained with classical error backpropagation. These findings represent a significant step toward biologically grounded, spike-driven learning in neural networks.
中文摘要:本研究提出的双向脉冲蒸馏算法通过联合训练前向和反向脉冲网络,在保持生物合理性的同时实现了与误差反向传播相当的性能表现。
English Summary: This study introduces Bidirectional Spike-based Distillation (BSD), a biologically plausible learning algorithm using spiking networks that achieves performance comparable to error backpropagation across various benchmarks.

Authors:Yuchuan Mao, Zhi Gao, Xiaomeng Fan, Yuwei Wu, Yunde Jia, Chenchen Jing
Title: Adaptive Model Ensemble for Continual Learning
Abstract:
Model ensemble is an effective strategy in continual learning, which alleviates catastrophic forgetting by interpolating model parameters, achieving knowledge fusion learned from different tasks. However, existing model ensemble methods usually encounter the knowledge conflict issue at task and layer levels, causing compromised learning performance in both old and new tasks. To solve this issue, we propose meta-weight-ensembler that adaptively fuses knowledge of different tasks for continual learning. Concretely, we employ a mixing coefficient generator trained via meta-learning to generate appropriate mixing coefficients for model ensemble to address the task-level knowledge conflict. The mixing coefficient is individually generated for each layer to address the layer-level knowledge conflict. In this way, we learn the prior knowledge about adaptively accumulating knowledge of different tasks in a fused model, achieving efficient learning in both old and new tasks. Meta-weight-ensembler can be flexibly combined with existing continual learning methods to boost their ability of alleviating catastrophic forgetting. Experiments on multiple continual learning datasets show that meta-weight-ensembler effectively alleviates catastrophic forgetting and achieves state-of-the-art performance.
中文: 该研究提出的元权重集成器通过元学习自适应生成分层混合系数,解决了持续学习中任务级和层级的知识冲突问题,在有效缓解灾难性遗忘的同时实现了最先进的性能。
English: The proposed meta-weight-ensembler adaptively generates layer-specific mixing coefficients through meta-learning to resolve task-level and layer-level knowledge conflicts in model ensemble, effectively alleviating catastrophic forgetting while achieving state-of-the-art performance in continual learning.

Authors:Adarsh Salagame, Henry Noyes, Alireza Ramezani, Eric Sihite, Arash Kalantari
Title: Crater Observing Bio-inspired Rolling Articulator (COBRA)
Abstract:
NASA aims to establish a sustainable human basecamp on the Moon as a stepping stone for future missions to Mars and beyond. The discovery of water ice on the Moon's craters located in permanently shadowed regions, which can provide drinking water, oxygen, and rocket fuel, is therefore of critical importance. However, current methods to access lunar ice deposits are limited. While rovers have been used to explore the lunar surface for decades, they face significant challenges in navigating harsh terrains, such as permanently shadowed craters, due to the high risk of immobilization. This report introduces COBRA (Crater Observing Bio-inspired Rolling Articulator), a multi-modal snake-style robot designed to overcome mobility challenges in Shackleton Crater's rugged environment. COBRA combines slithering and tumbling locomotion to adapt to various crater terrains. In snake mode, it uses sidewinding to traverse flat or low inclined surfaces, while in tumbling mode, it forms a circular barrel by linking its head and tail, enabling rapid movement with minimal energy on steep slopes. Equipped with an onboard computer, stereo camera, inertial measurement unit, and joint encoders, COBRA facilitates real-time data collection and autonomous operation. This paper highlights COBRAs robustness and efficiency in navigating extreme terrains through both simulations and experimental validation.
中文: NASA计划建立可持续的月球基地以支持未来火星任务,而COBRA机器人作为一种多模式解决方案被提出,旨在克服在月球崎岖环形山中获取水冰资源时所面临的移动性挑战。
English: NASA plans to build a sustainable lunar basecamp to support future Mars missions, and the COBRA robot is introduced as a multi-modal solution to overcome mobility challenges in accessing water ice in the Moon's rugged craters.

Authors:Runyan Yang, Yuke Si, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang
Title: Teaching Audio Models to Reason: A Unified Framework for Source- and Layer-wise Distillation
Abstract:
While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.
中文摘要:本文提出统一知识蒸馏框架,通过源级和层级蒸馏将文本教师模型的推理能力迁移至音频学生模型,有效弥合模态差距并显著提升音频推理性能。
English Summary: This paper introduces a unified knowledge distillation framework that transfers reasoning capabilities from a textual teacher model to audio student models through source-wise and layer-wise distillation, effectively bridging the modality gap and significantly improving audio reasoning performance.

Authors:Yuke Si, Runyan Yang, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang
Title: HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling
Abstract:
Recent advances in large language models have facilitated the development of unified speech language models (SLMs) capable of supporting multiple speech tasks within a shared architecture. However, tasks such as automatic speech recognition (ASR) and speech emotion recognition (SER) rely on distinct types of information: ASR primarily depends on linguistic content, whereas SER requires the integration of both linguistic and paralinguistic cues. Existing multitask SLMs typically adopt naive parameter sharing or prompt-based conditioning without explicitly modeling the differences in information composition required by each task. Such designs risk task interference and performance degradation, especially under limited data conditions. To address these limitations, we propose HarmoniFuse, a component-selective and prompt-adaptive framework for multi-task speech language modeling. HarmoniFuse is designed to harmonize heterogeneous task demands by selecting and fusing task-relevant components of speech representations. Specifically, it integrates a gated speech encoder to extract task-specific acoustic features and a prompt-adaptive dynamic fusion module to aggregate transformer layers based on task characteristics. In addition, a batch-interleaved training strategy enables leveraging separate ASR and SER datasets without requiring joint annotation. Experimental results demonstrate that HarmoniFuse improves both ASR and SER performance, offering a scalable and robust solution for multitask speech understanding under realistic data constraints.
中文:提出的HarmoniFuse框架通过选择性融合任务特定声学特征和动态适配Transformer层,解决了语音语言模型中的多任务干扰问题,在数据受限条件下同时提升了语音识别和情感识别的性能。
English: The proposed HarmoniFuse framework addresses multitask interference in speech language models by selectively fusing task-specific acoustic features and dynamically adapting transformer layers, improving both speech recognition and emotion recognition performance under data constraints.

Authors:Yunhao Yang, Junyuan Hong, Gabriel Jacob Perin, Zhiwen Fan, Li Yin, Zhangyang Wang, Ufuk Topcu
Title: AD-VF: LLM-Automatic Differentiation Enables Fine-Tuning-Free Robot Planning from Formal Methods Feedback
Abstract:
Large language models (LLMs) can translate natural language instructions into executable action plans for robotics, autonomous driving, and other domains. Yet, deploying LLM-driven planning in the physical world demands strict adherence to safety and regulatory constraints, which current models often violate due to hallucination or weak alignment. Traditional data-driven alignment methods, such as Direct Preference Optimization (DPO), require costly human labeling, while recent formal-feedback approaches still depend on resource-intensive fine-tuning. In this paper, we propose LAD-VF, a fine-tuning-free framework that leverages formal verification feedback for automated prompt engineering. By introducing a formal-verification-informed text loss integrated with LLM-AutoDiff, LAD-VF iteratively refines prompts rather than model parameters. This yields three key benefits: (i) scalable adaptation without fine-tuning; (ii) compatibility with modular LLM architectures; and (iii) interpretable refinement via auditable prompts. Experiments in robot navigation and manipulation tasks demonstrate that LAD-VF substantially enhances specification compliance, improving success rates from 60% to over 90%. Our method thus presents a scalable and interpretable pathway toward trustworthy, formally-verified LLM-driven control systems.
中文摘要:大语言模型能将自然语言指令转化为可执行的动作计划,但常因幻觉或对齐不足而违反安全约束,因此提出的LAD-VF框架通过形式化验证反馈进行提示工程而非微调,将规范遵从成功率从60%显著提升至90%以上。
English Summary: Large language models can translate instructions into action plans but often violate safety constraints, so the proposed LAD-VF framework uses formal verification feedback for prompt engineering instead of fine-tuning to significantly improve compliance rates from 60% to over 90%.

Authors:Zhefan Wang, Ning Geng, Zhiqiang Guo, Weizhi Ma, Min Zhang
Title: Human vs. Agent in Task-Oriented Conversations
Abstract:
Task-oriented conversational systems are essential for efficiently addressing diverse user needs, yet their development requires substantial amounts of high-quality conversational data that is challenging and costly to obtain. While large language models (LLMs) have demonstrated potential in generating synthetic conversations, the extent to which these agent-generated interactions can effectively substitute real human conversations remains unclear. This work presents the first systematic comparison between LLM-simulated users and human users in personalized task-oriented conversations. We propose a comprehensive analytical framework encompassing three key aspects (conversation strategy, interaction style, and conversation evaluation) and ten distinct dimensions for evaluating user behaviors, and collect parallel conversational datasets from both human users and LLM agent users across four representative scenarios under identical conditions. Our analysis reveals significant behavioral differences between the two user types in problem-solving approaches, question broadness, user engagement, context dependency, feedback polarity and promise, language style, and hallucination awareness. We found consistency in the agent users and human users across the depth-first or breadth-first dimensions, as well as the usefulness dimensions. These findings provide critical insights for advancing LLM-based user simulation. Our multi-dimensional taxonomy constructed a generalizable framework for analyzing user behavior patterns, offering insights from LLM agent users and human users. By this work, we provide perspectives on rethinking how to use user simulation in conversational systems in the future.
中文摘要:本研究首次系统比较了任务导向对话中LLM模拟用户与真实用户的行为差异,发现在八个维度存在显著区别,但在问题解决路径和实用性评估方面表现一致,为改进基于大语言模型的用户模拟提供了重要参考。
English Summary: This study systematically compares LLM-simulated users with human users in task-oriented conversations, revealing significant behavioral differences across eight dimensions while identifying consistency in problem-solving approaches and usefulness assessments.

Authors:Rinka Nobukawa, Makito Kitamura, Tomohiko Nakamura, Shinnosuke Takamichi, Hiroshi Saruwatari
Title: Drum-to-Vocal Percussion Sound Conversion and Its Evaluation Methodology
Abstract:
This paper defines the novel task of drum-to-vocal percussion (VP) sound conversion. VP imitates percussion instruments through human vocalization and is frequently employed in contemporary a cappella music. It exhibits acoustic properties distinct from speech and singing (e.g., aperiodicity, noisy transients, and the absence of linguistic structure), making conventional speech or singing synthesis methods unsuitable. We thus formulate VP synthesis as a timbre transfer problem from drum sounds, leveraging their rhythmic and timbral correspondence. To support this formulation, we define three requirements for successful conversion: rhythmic fidelity, timbral consistency, and naturalness as VP. We also propose corresponding subjective evaluation criteria. We implement two baseline conversion methods using a neural audio synthesizer, the real-time audio variational autoencoder (RAVE), with and without vector quantization (VQ). Subjective experiments show that both methods produce plausible VP outputs, with the VQ-based RAVE model yielding more consistent conversion.
中文: 本文提出将鼓声转换为人声打击乐作为音色迁移任务,并采用RAVE模型实现基准方法,其中基于矢量量化的模型展现出更优的转换一致性。
English: This paper introduces drum-to-vocal percussion conversion as a timbre transfer task and proposes baseline methods using RAVE models, with the VQ-enhanced version demonstrating superior consistency.

Authors:Pengcheng Li, Botao Zhao, Zuheng Kang, Junqing Peng, Xiaoyang Qu, Yayun He, Jianzong Wang
Title: EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition
Abstract:
Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent advances in Reinforcement Learning (RL) have shown promise in improving LALMs' reasoning abilities. However, two critical challenges hinder the direct application of RL techniques to Speech Emotion Recognition (SER) tasks: (1) convergence instability caused by ambiguous emotional boundaries and (2) limited reasoning ability when using relatively small models (e.g., 7B-parameter architectures). To overcome these limitations, we introduce EMO-RL, a novel framework incorporating reinforcement learning with two key innovations: Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). Built upon pretrained LALMs, our method employs group-relative policy optimization with emotion constraints. Comprehensive experiments demonstrate that our EMO-RL training strategies can significantly enhance the emotional reasoning capabilities of LALMs, attaining state-of-the-art results on both the MELD and IEMOCAP datasets, and cross-dataset experiments prove the strong superiority of generalization.
Chinese: 尽管大型音频语言模型在情感识别与推理方面表现欠佳,但通过引入EMO-RL强化学习框架(包含情感相似性加权奖励和显式结构化推理机制),该模型在MELD和IEMOCAP数据集上实现了最优性能并展现出卓越的泛化能力。
English: Large Audio-Language Models (LALMs) still struggle with emotion recognition and reasoning, but the new EMO-RL framework, featuring Emotion Similarity-Weighted Reward and Explicit Structured Reasoning, significantly boosts their emotional reasoning capabilities and achieves top performance on benchmark datasets.

Authors:Tahar Chettaoui, Naser Damer, Fadi Boutros
Title: Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications
Abstract:
Foundation models such as CLIP have demonstrated exceptional zero- and few-shot transfer capabilities across diverse vision tasks. However, when fine-tuned for highly specialized biometric tasks, face recognition (FR), morphing attack detection (MAD), and presentation attack detection (PAD), these models may suffer from over-specialization. Thus, they may lose one of their foundational strengths, cross-domain generalization. In this work, we systematically quantify these trade-offs by evaluating three instances of CLIP fine-tuned for FR, MAD, and PAD. We evaluate each adapted model as well as the original CLIP baseline on 14 general vision datasets under zero-shot and linear-probe protocols, alongside common FR, MAD, and PAD benchmarks. Our results indicate that fine-tuned models suffer from over-specialization, especially when fine-tuned for complex tasks of FR. Also, our results pointed out that task complexity and classification head design, multi-class (FR) vs. binary (MAD and PAD), correlate with the degree of catastrophic forgetting. The FRoundation model with the ViT-L backbone outperforms other approaches on the large-scale FR benchmark IJB-C, achieving an improvement of up to 58.52%. However, it experiences a substantial performance drop on ImageNetV2, reaching only 51.63% compared to 69.84% achieved by the baseline CLIP model. Moreover, the larger CLIP architecture consistently preserves more of the model's original generalization ability than the smaller variant, indicating that increased model capacity may help mitigate over-specialization.
中文: 将CLIP微调用于面部识别等专业生物识别任务会导致过度专业化,削弱其跨领域泛化能力,但更大的模型架构有助于保留更多原始性能。
English: Fine-tuning CLIP for specialized biometric tasks like face recognition leads to over-specialization, reducing its cross-domain generalization, but larger model architectures can help preserve more of the original capabilities.

Authors:Kentaro Seki, Yuki Okamoto, Kouei Yamaoka, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari
Title: Spatial-CLAP: Learning Spatially-Aware audio--text Embeddings for Multi-Source Conditions
Abstract:
Contrastive language--audio pretraining (CLAP) has achieved remarkable success as an audio--text embedding framework, but existing approaches are limited to monaural or single-source conditions and cannot fully capture spatial information. The central challenge in modeling spatial information lies in multi-source conditions, where the correct correspondence between each sound source and its location is required. To tackle this problem, we propose Spatial-CLAP, which introduces a content-aware spatial encoder that enables spatial representations coupled with audio content. We further propose spatial contrastive learning (SCL), a training strategy that explicitly enforces the learning of the correct correspondence and promotes more reliable embeddings under multi-source conditions. Experimental evaluations, including downstream tasks, demonstrate that Spatial-CLAP learns effective embeddings even under multi-source conditions, and confirm the effectiveness of SCL. Moreover, evaluation on unseen three-source mixtures highlights the fundamental distinction between conventional single-source training and our proposed multi-source training paradigm. These findings establish a new paradigm for spatially-aware audio--text embeddings.
Chinese: Spatial-CLAP通过引入内容感知的空间编码器和空间对比学习,有效解决了多源音频条件下的空间信息建模问题,为空间感知的音频-文本嵌入建立了新范式。
English: Spatial-CLAP introduces a content-aware spatial encoder and spatial contrastive learning to effectively capture spatial information in multi-source audio conditions, establishing a new paradigm for spatially-aware audio-text embeddings.

Authors:Ye-Xin Lu, Yu Gu, Kun Wei, Hui-Peng Du, Yang Ai, Zhen-Hua Ling
Title: DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis
Abstract:
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background environment of the synthesized speech. Built upon F5-TTS, the proposed DAIEN-TTS first incorporates a pretrained speech-environment separation (SES) module to disentangle the environmental speech into mel-spectrograms of clean speech and environment audio. Two random span masks of varying lengths are then applied to both mel-spectrograms, which, together with the text embedding, serve as conditions for infilling the masked environmental mel-spectrogram, enabling the simultaneous continuation of personalized speech and time-varying environmental audio. To further enhance controllability during inference, we adopt dual class-free guidance (DCFG) for the speech and environment components and introduce a signal-to-noise ratio (SNR) adaptation strategy to align the synthesized speech with the environment prompt. Experimental results demonstrate that DAIEN-TTS generates environmental personalized speech with high naturalness, strong speaker similarity, and high environmental fidelity.
中文: DAIEN-TTS是一种零样本文本转语音框架,通过解缠音频填充技术独立控制说话人音色和背景环境,实现了高度自然、说话人相似且环境保真度高的语音合成。
English: DAIEN-TTS is a zero-shot text-to-speech framework that independently controls speaker timbre and background environment through disentangled audio infilling, achieving highly natural, speaker-similar, and environmentally faithful speech synthesis.

Authors:Fei Liu, Yang Ai, Zhen-Hua Ling
Title: Neural Speech Separation with Parallel Amplitude and Phase Spectrum Estimation
Abstract:
This paper proposes APSS, a novel neural speech separation model with parallel amplitude and phase spectrum estimation. Unlike most existing speech separation methods, the APSS distinguishes itself by explicitly estimating the phase spectrum for more complete and accurate separation. Specifically, APSS first extracts the amplitude and phase spectra from the mixed speech signal. Subsequently, the extracted amplitude and phase spectra are fused by a feature combiner into joint representations, which are then further processed by a deep processor with time-frequency Transformers to capture temporal and spectral dependencies. Finally, leveraging parallel amplitude and phase separators, the APSS estimates the respective spectra for each speaker from the resulting features, which are then combined via inverse short-time Fourier transform (iSTFT) to reconstruct the separated speech signals. Experimental results indicate that APSS surpasses both time-domain separation methods and implicit-phase-estimation-based time-frequency approaches. Also, APSS achieves stable and competitive results on multiple datasets, highlighting its strong generalization capability and practical applicability.
中文: 本文提出APSS新型神经语音分离模型,通过并行估计振幅谱与相位谱实现更精准的分离效果,在多数据集上展现出卓越的泛化能力,性能优于传统时域分离和隐式相位估计方法。
English: This paper introduces APSS, a novel neural speech separation model that uniquely estimates both amplitude and phase spectra in parallel, outperforming existing methods by achieving more accurate separation and demonstrating strong generalization across multiple datasets.

Authors:Lin Zhu, Lingwei Kong, Xin Ning, Xiaoyang Qu, Jianzong Wang
Title: Publicly Verifiable Private Information Retrieval Protocols Based on Function Secret Sharing
Abstract:
Private Information Retrieval (PIR) is a fundamental cryptographic primitive that enables users to retrieve data from a database without revealing which item is being accessed, thereby preserving query privacy. However, PIR protocols also face the challenge of result verifiability, as users expect the reconstructed data to be trustworthy and authentic. In this work, we propose two effective constructions of publicly verifiable PIR (PVPIR) in the multi-server setting, which achieve query privacy, correctness, and verifiability simultaneously. We further present three concrete instantiations based on these constructions. For the point query, our protocol introduces minimal computational overhead and achieves strong verifiability guarantees with significantly lower communication costs compared to existing Merkle tree-based approaches. For the predicate query, the communication complexity of our scheme remains stable as the database size increases, demonstrating strong scalability and suitability for large-scale private query applications.
Chinese: 本研究提出了两种多服务器环境下的公开可验证私有信息检索(PVPIR)方案,在实现查询隐私性、正确性和可验证性的同时,点查询计算开销极小且谓词查询具有稳定的通信复杂度。
English: This work introduces two publicly verifiable private information retrieval (PVPIR) constructions for multi-server settings, offering query privacy, correctness, and verifiability with minimal overhead for point queries and scalable communication for predicate queries.

Authors:Luís F. Gomes, Xin Zhou, David Lo, Rui Abreu
Title: VisDocSketcher: Towards Scalable Visual Documentation with Agentic Systems
Abstract:
Visual documentation is an effective tool for reducing the cognitive barrier developers face when understanding unfamiliar code, enabling more intuitive comprehension. Compared to textual documentation, it provides a higher-level understanding of the system structure and data flow. Developers usually prefer visual representations over lengthy textual descriptions for large software systems. Visual documentation is both difficult to produce and challenging to evaluate. Manually creating it is time-consuming, and currently, no existing approach can automatically generate high-level visual documentation directly from code. Its evaluation is often subjective, making it difficult to standardize and automate. To address these challenges, this paper presents the first exploration of using agentic LLM systems to automatically generate visual documentation. We introduce VisDocSketcher, the first agent-based approach that combines static analysis with LLM agents to identify key elements in the code and produce corresponding visual representations. We propose a novel evaluation framework, AutoSketchEval, for assessing the quality of generated visual documentation using code-level metrics. The experimental results show that our approach can valid visual documentation for 74.4% of the samples. It shows an improvement of 26.7-39.8% over a simple template-based baseline. Our evaluation framework can reliably distinguish high-quality (code-aligned) visual documentation from low-quality (non-aligned) ones, achieving an AUC exceeding 0.87. Our work lays the foundation for future research on automated visual documentation by introducing practical tools that not only generate valid visual representations but also reliably assess their quality.
中文: 本文提出了首个基于智能体与LLM的视觉文档自动生成系统VisDocSketcher,并开发了AutoSketchEval评估框架,能可靠地区分高质量与低质量的可视化文档,为自动化视觉文档研究奠定了基础。
English: This paper introduces VisDocSketcher, the first agent-based system using LLMs to automatically generate visual documentation from code, and proposes AutoSketchEval, a novel evaluation framework that effectively assesses documentation quality with high reliability.

Authors:Jonas Kühne, Christian Vogt, Michele Magno, Luca Benini
Title: Efficient and Accurate Downfacing Visual Inertial Odometry
Abstract:
Visual Inertial Odometry (VIO) is a widely used computer vision method that determines an agent's movement through a camera and an IMU sensor. This paper presents an efficient and accurate VIO pipeline optimized for applications on micro- and nano-UAVs. The proposed design incorporates state-of-the-art feature detection and tracking methods (SuperPoint, PX4FLOW, ORB), all optimized and quantized for emerging RISC-V-based ultra-low-power parallel systems on chips (SoCs). Furthermore, by employing a rigid body motion model, the pipeline reduces estimation errors and achieves improved accuracy in planar motion scenarios. The pipeline's suitability for real-time VIO is assessed on an ultra-low-power SoC in terms of compute requirements and tracking accuracy after quantization. The pipeline, including the three feature tracking methods, was implemented on the SoC for real-world validation. This design bridges the gap between high-accuracy VIO pipelines that are traditionally run on computationally powerful systems and lightweight implementations suitable for microcontrollers. The optimized pipeline on the GAP9 low-power SoC demonstrates an average reduction in RMSE of up to a factor of 3.65x over the baseline pipeline when using the ORB feature tracker. The analysis of the computational complexity of the feature trackers further shows that PX4FLOW achieves on-par tracking accuracy with ORB at a lower runtime for movement speeds below 24 pixels/frame.
本文提出了一种针对微型无人机的优化视觉惯性里程计流程,集成了针对RISC-V芯片量化的先进特征跟踪方法,在保持精度的同时显著提升了运行效率。
This paper introduces an optimized Visual Inertial Odometry pipeline for micro-UAVs, integrating advanced feature tracking methods quantized for RISC-V SoCs to enhance accuracy and efficiency in real-time applications.

Authors:Zhuoyuan Li, Jiacheng Li, Yao Li, Jialin Li, Li Li, Dong Liu, Feng Wu
Title: In-Loop Filtering Using Learned Look-Up Tables for Video Coding
Abstract:
In-loop filtering (ILF) is a key technology in video coding standards to reduce artifacts and enhance visual quality. Recently, neural network-based ILF schemes have achieved remarkable coding gains, emerging as a powerful candidate for next-generation video coding standards. However, the use of deep neural networks (DNN) brings significant computational and time complexity or high demands for dedicated hardware, making it challenging for general use. To address this limitation, we study a practical ILF solution by adopting look-up tables (LUTs). After training a DNN with a restricted reference range for ILF, all possible inputs are traversed, and the output values of the DNN are cached into LUTs. During the coding process, the filtering process is performed by simply retrieving the filtered pixel through locating the input pixels and interpolating between the cached values, instead of relying on heavy inference computations. In this paper, we propose a universal LUT-based ILF framework, termed LUT-ILF++. First, we introduce the cooperation of multiple kinds of filtering LUTs and propose a series of customized indexing mechanisms to enable better filtering reference perception with limited storage consumption. Second, we propose the cross-component indexing mechanism to enable the filtering of different color components jointly. Third, in order to make our solution practical for coding uses, we propose the LUT compaction scheme to enable the LUT pruning, achieving a lower storage cost of the entire solution. The proposed framework is implemented in the VVC reference software. Experimental results show that the proposed framework achieves on average 0.82%/2.97%/1.63% and 0.85%/4.11%/2.06% bitrate reduction for common test sequences, under the AI and RA configurations, respectively. Compared to DNN-based solutions, our proposed solution has much lower time complexity and storage cost.
中文: 本研究提出了LUT-ILF++框架,通过查找表替代计算密集型深度神经网络实现环路滤波,在显著降低码率的同时大幅减少了计算复杂度和存储需求。
English: This study introduces LUT-ILF++, a practical in-loop filtering framework using look-up tables to replace computationally intensive deep neural networks, achieving significant bitrate reduction with lower complexity and storage costs.

Authors:Xiaoxue Luo, Jinwei Huang, Runyan Yang, Yingying Gao, Junlan Feng, Chao Deng, Shilei Zhang
Title: DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners
Abstract:
Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds, and downstream tasks require selective access to these components. Therefore, we rethink the audio codec as a universal disentangled representation learner to enable controllable feature selection across different audio tasks. To this end, we introduce DeCodec, a novel neural codec that learns to decouple audio representations into orthogonal subspaces dedicated to speech and background sound, and within speech, representations are further decomposed into semantic and paralinguistic components. This hierarchical disentanglement allows flexible feature selection, making DeCodec a universal front-end for multiple audio applications. Technically, built upon a codec framework, DeCodec incorporates two key innovations: a subspace orthogonal projection module that factorizes the input into two decoupled orthogonal subspaces, and a representation swap training procedure that ensures these two subspaces are correlate to the speech and background sound, respectively. These allows parallel RVQs to quantize speech and background sound components independently. Furthermore, we employ semantic guidance to the speech RVQ to achieve semantic and paralinguistic decomposition. Experimental results show that DeCodec maintains advanced signal reconstruction while enabling new capabilities: superior speech enhancement and effective one-shot voice conversion on noisy speech via representation recombination, improved ASR robustness through clean semantic features, and controllable background sound preservation/suppression in TTS. Demo Page: https://luo404.github.io/DeCodecV2/
中文摘要:DeCodec是一种新型神经音频编解码器,通过层次化解耦将音频分离为语音和背景声的正交子空间,并在语音内部进一步分解为语义和副语言成分,从而为多种音频应用提供灵活的特征选择能力。
English Summary: DeCodec is a novel neural audio codec that hierarchically disentangles audio into orthogonal subspaces for speech and background sounds, with speech further decomposed into semantic and paralinguistic components, enabling flexible feature selection for various audio applications.

Authors:Zongzheng Zhang, Chenghao Yue, Haobo Xu, Minwen Liao, Xianglin Qi, Huan-ang Gao, Ziwei Wang, Hao Zhao
Title: RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation
Abstract:
Robotic chemists promise to both liberate human experts from repetitive tasks and accelerate scientific discovery, yet remain in their infancy. Chemical experiments involve long-horizon procedures over hazardous and deformable substances, where success requires not only task completion but also strict compliance with experimental norms. To address these challenges, we propose \textit{RoboChemist}, a dual-loop framework that integrates Vision-Language Models (VLMs) with Vision-Language-Action (VLA) models. Unlike prior VLM-based systems (e.g., VoxPoser, ReKep) that rely on depth perception and struggle with transparent labware, and existing VLA systems (e.g., RDT, pi0) that lack semantic-level feedback for complex tasks, our method leverages a VLM to serve as (1) a planner to decompose tasks into primitive actions, (2) a visual prompt generator to guide VLA models, and (3) a monitor to assess task success and regulatory compliance. Notably, we introduce a VLA interface that accepts image-based visual targets from the VLM, enabling precise, goal-conditioned control. Our system successfully executes both primitive actions and complete multi-step chemistry protocols. Results show 23.57% higher average success rate and a 0.298 average increase in compliance rate over state-of-the-art VLA baselines, while also demonstrating strong generalization to objects and tasks.
中文摘要:RoboChemist是一个结合视觉语言模型与视觉语言动作模型的双循环框架,在执行化学实验时比现有系统表现出更高的成功率和实验规范遵循度。
English Summary: RoboChemist is a dual-loop framework combining Vision-Language Models and Vision-Language-Action models that outperforms existing systems in executing chemical experiments with higher success rates and better compliance with experimental norms.

Authors:Zongzheng Zhang, Haobo Xu, Zhuo Yang, Chenghao Yue, Zehao Lin, Huan-ang Gao, Ziwei Wang, Hao Zhao
Title: TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models
Abstract:
Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings.
中文: 本研究提出扭矩感知的视觉-语言-动作模型,通过优化扭矩适配器位置和辅助预测策略,有效提升机器人在接触密集型操作任务中的性能表现。
English: This study introduces Torque-aware Vision-Language-Action models that enhance robotic manipulation by incorporating torque signals through optimized adapter placement and auxiliary prediction, significantly improving performance in contact-rich tasks.

Authors:Xiyu Guo, Shan Wang, Chunfang Ji, Xuefeng Zhao, Wenhao Xi, Yaoyao Liu, Qinglan Li, Chao Deng, Junlan Feng
Title: Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference
Abstract:
The rapid advancement of large language models (LLMs) and domain-specific AI agents has greatly expanded the ecosystem of AI-powered services. User queries, however, are highly diverse and often span multiple domains and task types, resulting in a complex and heterogeneous landscape. This diversity presents a fundamental routing challenge: how to accurately direct each query to an appropriate execution unit while optimizing both performance and efficiency. To address this, we propose MoMA (Mixture of Models and Agents), a generalized routing framework that integrates both LLM and agent-based routing. Built upon a deep understanding of model and agent capabilities, MoMA effectively handles diverse queries through precise intent recognition and adaptive routing strategies, achieving an optimal balance between efficiency and cost. Specifically, we construct a detailed training dataset to profile the capabilities of various LLMs under different routing model structures, identifying the most suitable tasks for each LLM. During inference, queries are dynamically routed to the LLM with the best cost-performance efficiency. We also introduce an efficient agent selection strategy based on a context-aware state machine and dynamic masking. Experimental results demonstrate that the MoMA router offers superior cost-efficiency and scalability compared to existing approaches.
中文: MoMA是一个通用路由框架,通过整合大语言模型和智能体技术,将多样化用户查询精准路由至最佳执行单元,实现了成本效益与可扩展性的最优平衡。
English: MoMA is a generalized routing framework that uses LLMs and agents to efficiently direct diverse user queries to the most suitable execution units, achieving optimal cost-performance balance and scalability.

Authors:Nico Krull, Lukas Schulthess, Michele Magno, Luca Benini, Christoph Leitner
Title: Wireless Low-Latency Synchronization for Body-Worn Multi-Node Systems in Sports
Abstract:
Biomechanical data acquisition in sports demands sub-millisecond synchronization across distributed body-worn sensor nodes. This study evaluates and characterizes the Enhanced ShockBurst (ESB) protocol from Nordic Semiconductor under controlled laboratory conditions for wireless, low-latency command broadcasting, enabling fast event updates in multi-node systems. Through systematic profiling of protocol parameters, including cyclic-redundancy-check modes, bitrate, transmission modes, and payload handling, we achieve a mean Device-to-Device (D2D) latency of 504.99 +- 96.89 us and a network-to-network core latency of 311.78 +- 96.90 us using a one-byte payload with retransmission optimization. This performance significantly outperforms Bluetooth Low Energy (BLE), which is constrained by a 7.5 ms connection interval, by providing deterministic, sub-millisecond synchronization suitable for high-frequency (500 Hz to 1000 Hz) biosignals. These results position ESB as a viable solution for time-critical, multi-node wearable systems in sports, enabling precise event alignment and reliable high-speed data fusion for advanced athlete monitoring and feedback applications.
中文: 本研究证实增强型ShockBurst协议能为分布式传感器网络实现亚毫秒级同步,其确定性延迟特性显著优于蓝牙低功耗技术,适用于体育应用中高频生物信号的精准监测。
English: This study demonstrates that the Enhanced ShockBurst protocol achieves sub-millisecond synchronization for distributed sensor networks, significantly outperforming Bluetooth Low Energy with deterministic latency suitable for high-frequency biosignal monitoring in sports applications.

Authors:Yihong Leng, Siming Zheng, Jinwei Chen, Bo Li, Jiaojiao Li, Peng-Tao Jiang
Title: RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentangled Representation
Abstract:
Event cameras provide sparse yet temporally high-temporal-resolution motion information, demonstrating great potential for motion deblurring. Existing methods focus on cross-modal interaction, overlooking the inherent incompleteness of event streams, which arises from the trade-off between sensitivity and noise introduced by the thresholding mechanism of Dynamic Vision Sensors (DVS). Such degradation compromises the integrity of motion priors and limits the effectiveness of event-guided deblurring. To tackle these challenges, we propose a Robust Event-guided Deblurring (RED) network with modality-specific disentangled representation. First, we introduce a Robustness-Oriented Perturbation Strategy (RPS) that applies random masking to events, which exposes RED to incomplete patterns and then foster robustness against various unknown scenario conditions.Next, a disentangled OmniAttention is presented to explicitly model intra-motion, inter-motion, and cross-modality correlations from two inherently distinct but complementary sources: blurry images and partially disrupted events. Building on these reliable features, two interactive modules are designed to enhance motion-sensitive areas in blurry images and inject semantic context into incomplete event representations. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in both accuracy and robustness.
中文: 提出的鲁棒事件引导去模糊(RED)网络通过扰动策略和模态特定表征,解决了事件相机中的噪声和欠报告问题,在去模糊性能和鲁棒性方面均达到了最优水平。
English: The proposed Robust Event-guided Deblurring (RED) network addresses noise and under-reporting issues in event cameras through a perturbation strategy and modality-specific representation, achieving state-of-the-art deblurring performance with enhanced robustness.

Authors:Yihong Leng, Siming Zheng, Jinwei Chen, Bo Li, Jiaojiao Li, Peng-Tao Jiang
Title: RED: Robust Event-Guided Motion Deblurring with Modality-Specific Disentangled Representation
Abstract:
Event cameras provide sparse yet temporally high-resolution motion information, demonstrating great potential for motion deblurring. However, the delicate events are highly susceptible to noise. Although noise can be reduced by raising the threshold of Dynamic Vision Sensors (DVS), this inevitably causes under-reporting of events. Most existing event-guided deblurring methods overlook this practical trade-off, and the indiscriminate feature extraction and naive fusion result in unstable and mixed representations and ultimately unsatisfactory performance. To tackle these challenges, we propose a Robust Event-guided Deblurring (RED) network with modality-specific disentangled representation. First, we introduce a Robustness-Oriented Perturbation Strategy (RPS) that mimics various DVS thresholds, exposing RED to diverse under-reporting patterns and thereby fostering robustness under unknown conditions. With an adaption to RPS, a Modality-specific Representation Mechanism (MRM) is designed to explicitly model semantic understanding, motion priors, and cross-modality correlations from two inherently distinct but complementary sources: blurry images and partially disrupted events. Building on these reliable features, two interactive modules are presented to enhance motion-sensitive areas in blurry images and inject semantic context into under-reporting event representations. Extensive experiments on synthetic and real-world datasets demonstrate RED consistently achieves state-of-the-art performance in terms of both accuracy and robustness.
中文: 提出的鲁棒事件引导去模糊(RED)网络通过扰动策略和模态特定表征,解决了事件相机中的噪声和欠报告问题,在去模糊性能和鲁棒性方面均达到了最优水平。
English: The proposed Robust Event-guided Deblurring (RED) network addresses noise and under-reporting issues in event cameras through a perturbation strategy and modality-specific representation, achieving state-of-the-art deblurring performance with enhanced robustness.

Authors:Quan Chen, Chenrui Shi, Qi Chen, Yuwei Wu, Zhi Gao, Xintong Zhang, Rui Gao, Kun Wu, Yunde Jia
Title: Long-Horizon Visual Imitation Learning via Plan and Code Reflection
Abstract:
Learning from long-horizon demonstrations with complex action sequences presents significant challenges for visual imitation learning, particularly in understanding temporal relationships of actions and spatial relationships between objects. In this paper, we propose a new agent framework that incorporates two dedicated reflection modules to enhance both plan and code generation. The plan generation module produces an initial action sequence, which is then verified by the plan reflection module to ensure temporal coherence and spatial alignment with the demonstration video. The code generation module translates the plan into executable code, while the code reflection module verifies and refines the generated code to ensure correctness and consistency with the generated plan. These two reflection modules jointly enable the agent to detect and correct errors in both the plan generation and code generation, improving performance in tasks with intricate temporal and spatial dependencies. To support systematic evaluation, we introduce LongVILBench, a benchmark comprising 300 human demonstrations with action sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial complexity across multiple task types. Experimental results demonstrate that existing methods perform poorly on this benchmark, whereas our new framework establishes a strong baseline for long-horizon visual imitation learning.
中文摘要:本文提出一种新型智能体框架,通过双重反思模块分别优化计划与代码生成,有效解决长时序视觉模仿学习中的时空关系难题,并在LongVILBench基准测试中确立了优势性能。
English Summary: This paper introduces a novel agent framework with dual reflection modules for plan and code generation to address challenges in long-horizon visual imitation learning, validated by a new benchmark called LongVILBench where it establishes a strong baseline.

Authors:Quan Chen, Chenrui Shi, Qi Chen, Yuwei Wu, Zhi Gao, Xintong Zhang, Rui Gao, Kun Wu, Yunde Jia
Title: Long-Horizon Visual Imitation Learning via Plan and Code Reflection
Abstract:
Learning from long-horizon demonstrations with complex action sequences presents significant challenges for visual imitation learning, particularly in understanding temporal relationships of actions and spatial relationships between objects. In this paper, we propose a new agent framework that incorporates two dedicated reflection modules to enhance both plan and code generation. The plan generation module produces an initial action sequence, which is then verified by the plan reflection module to ensure temporal coherence and spatial alignment with the demonstration video. The code generation module translates the plan into executable code, while the code reflection module verifies and refines the generated code to ensure correctness and consistency with the generated plan. These two reflection modules jointly enable the agent to detect and correct errors in both the plan generation and code generation, improving performance in tasks with intricate temporal and spatial dependencies. To support systematic evaluation, we introduce LongVILBench, a benchmark comprising 300 human demonstrations with action sequences of up to 18 steps. LongVILBench emphasizes temporal and spatial complexity across multiple task types. Experimental results demonstrate that existing methods perform poorly on this benchmark, whereas our new framework establishes a strong baseline for long-horizon visual imitation learning.
中文摘要:本文提出一种新型智能体框架,通过双重反思模块分别优化计划与代码生成,有效解决长时序视觉模仿学习中的时空关系难题,并在LongVILBench基准测试中确立了优势性能。
English Summary: This paper introduces a novel agent framework with dual reflection modules for plan and code generation to address challenges in long-horizon visual imitation learning, validated by a new benchmark called LongVILBench where it establishes a strong baseline.

Authors:Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
Title: Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations
Abstract:
We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.
中文摘要:Entropy2Vec是一种创新框架,利用单语语言模型的熵来生成跨语言表征,通过结构相似性捕捉类型学关系,并在多语言自然语言处理任务中展现出优异性能。
English Summary: Entropy2Vec is a novel framework that uses the entropy of monolingual language models to create cross-lingual representations, capturing typological relationships through structural similarity and achieving competitive performance in multilingual NLP tasks.

Authors:Shuzhou Yang, Xiaoyu Li, Xiaodong Cun, Guangzhi Wang, Lingen Li, Ying Shan, Jian Zhang
Title: GenCompositor: Generative Video Compositing with Diffusion Transformer
Abstract:
Video compositing combines live-action footage to create video production, serving as a crucial technique in video creation and film production. Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs. To address this issue, we automate this process with generative models, called generative video compositing. This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner, allowing users to customize the size, motion trajectory, and other attributes of the dynamic elements added in final video. Specifically, we designed a novel Diffusion Transformer (DiT) pipeline based on its intrinsic properties. To maintain consistency of the target video before and after editing, we revised a light-weight DiT-based background preservation branch with masked token injection. As to inherit dynamic elements from other sources, a DiT fusion block is proposed using full self-attention, along with a simple yet effective foreground augmentation for training. Besides, for fusing background and foreground videos with different layouts based on user control, we developed a novel position embedding, named Extended Rotary Position Embedding (ERoPE). Finally, we curated a dataset comprising 61K sets of videos for our new task, called VideoComp. This data includes complete dynamic elements and high-quality target videos. Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
中文: 生成式视频合成通过创新的扩散变换器流程,自动化地将前景视频的身份与运动信息自适应注入目标视频,在保持背景一致性的同时允许用户自定义动态元素属性,显著提升了保真度与连贯性。
English: Generative video compositing automates the traditional labor-intensive process by using a novel Diffusion Transformer pipeline to adaptively inject identity and motion from foreground videos into target videos while maintaining background consistency through user-customizable controls.

Authors:Joonyong Park, Shinnosuke Takamichi, David M. Chan, Shunsuke Kando, Yuki Saito, Hiroshi Saruwatari
Title: Analysing the Language of Neural Audio Codecs
Abstract:
This study presents a comparative analysis of the statistical and linguistic properties of neural audio codecs (NACs). We investigate discrete speech tokens produced by various NAC models, examining their adherence to linguistic statistical laws such as Zipf's law and Heaps' law, as well as their entropy and redundancy. To assess how these token-level properties relate to semantic and acoustic preservation in synthesized speech, we evaluate intelligibility using error rates of automatic speech recognition, and quality using the UTMOS score. Our results reveal that NAC tokens, particularly 3-grams, exhibit language-like statistical patterns. Moreover, these properties, together with measures of information content, are found to correlate with improved performances in speech recognition and resynthesis tasks. These findings offer insights into the structure of NAC token sequences and inform the design of more effective generative speech models.
本研究比较了神经音频编解码器,通过分析其语音标记的统计和语言模式,发现这些特性与更优的语音识别和合成性能相关。
This study compares neural audio codecs by analyzing their speech tokens' statistical and linguistic patterns, finding that these properties correlate with better speech recognition and synthesis performance.

Authors:Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski, Luisa Bentivogli
Title: The Unheard Alternative: Contrastive Explanations for Speech-to-Text Models
Abstract:
Contrastive explanations, which indicate why an AI system produced one output (the target) instead of another (the foil), are widely regarded in explainable AI as more informative and interpretable than standard explanations. However, obtaining such explanations for speech-to-text (S2T) generative models remains an open challenge. Drawing from feature attribution techniques, we propose the first method to obtain contrastive explanations in S2T by analyzing how parts of the input spectrogram influence the choice between alternative outputs. Through a case study on gender assignment in speech translation, we show that our method accurately identifies the audio features that drive the selection of one gender over another. By extending the scope of contrastive explanations to S2T, our work provides a foundation for better understanding S2T models.
中文: 本文提出了首个通过分析输入频谱图影响来生成语音转文本模型对比解释的方法,并通过性别分配案例验证了其准确识别决定性别选择的音频特征的有效性。
English: This paper introduces the first method for generating contrastive explanations in speech-to-text models by analyzing input spectrogram influences, demonstrating its effectiveness in identifying audio features that determine gender assignment choices.

Authors:Gustavo Cilleruelo, Emily Allaway, Barry Haddow, Alexandra Birch
Title: MGen: Millions of Naturally Occurring Generics in Context
Abstract:
MGen is a dataset of over 4 million naturally occurring generic and quantified sentences extracted from diverse textual sources. Sentences in the dataset have long context documents, corresponding to websites and academic papers, and cover 11 different quantifiers. We analyze the features of generics sentences in the dataset, with interesting insights: generics can be long sentences (averaging over 16 words) and speakers often use them to express generalisations about people. MGen is the biggest and most diverse dataset of naturally occurring generic sentences, opening the door to large-scale computational research on genericity. It is publicly available at https://gustavocilleruelo.com/mgen
中文: MGen 是一个包含超过 400 万条来自不同来源的自然通用和量化句子的最大且最多样化的数据集,为泛型性的大规模计算研究开辟了道路,并公开可访问。
English: MGen is the largest and most diverse dataset of over 4 million naturally occurring generic and quantified sentences from various sources, enabling large-scale computational research on genericity and publicly accessible online.

Authors:Peilong Han, Fan Jia, Min Zhang, Yutao Qiu, Hongyao Tang, Yan Zheng, Tiancai Wang, Jianye Hao
Title: MUVLA: Learning to Explore Object Navigation via Map Understanding
Abstract:
In this paper, we present MUVLA, a Map Understanding Vision-Language-Action model tailored for object navigation. It leverages semantic map abstractions to unify and structure historical information, encoding spatial context in a compact and consistent form. MUVLA takes the current and history observations, as well as the semantic map, as inputs and predicts the action sequence based on the description of goal object. Furthermore, it amplifies supervision through reward-guided return modeling based on dense short-horizon progress signals, enabling the model to develop a detailed understanding of action value for reward maximization. MUVLA employs a three-stage training pipeline: learning map-level spatial understanding, imitating behaviors from mixed-quality demonstrations, and reward amplification. This strategy allows MUVLA to unify diverse demonstrations into a robust spatial representation and generate more rational exploration strategies. Experiments on HM3D and Gibson benchmarks demonstrate that MUVLA achieves great generalization and learns effective exploration behaviors even from low-quality or partially successful trajectories.
中文: 本文提出MUVLA模型,该模型通过语义地图整合历史信息并采用三阶段训练策略来预测导航动作,在基准测试中展现出优异的泛化能力和探索效率。
English: This paper introduces MUVLA, a vision-language-action model that utilizes semantic maps to structure historical data and predict navigation actions through a three-stage training process, demonstrating strong generalization and effective exploration on benchmark tests.

Authors:Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou
Title: Towards Personalized Deep Research: Benchmarks and Evaluations
Abstract:
Deep Research Agents (DRAs) can autonomously conduct complex investigations and generate comprehensive reports, demonstrating strong real-world potential. However, existing evaluations mostly rely on close-ended benchmarks, while open-ended deep research benchmarks remain scarce and typically neglect personalized scenarios. To bridge this gap, we introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in DRAs. It pairs 50 diverse research tasks across 10 domains with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries. To assess system performance, we propose the PQR Evaluation Framework, which jointly measures (P) Personalization Alignment, (Q) Content Quality, and (R) Factual Reliability. Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research. This work establishes a rigorous foundation for developing and evaluating the next generation of truly personalized AI research assistants.
中文: 本文提出了首个用于评估深度研究智能体个性化能力的基准测试——个性化深度研究平台,它包含10个领域的250个用户任务查询,并采用PQR评估框架综合衡量个性化匹配度、内容质量和事实可靠性,为开发下一代真正个性化的AI研究助手奠定了坚实基础。
English: This paper introduces the Personalized Deep Research Bench, the first benchmark designed to evaluate personalization in Deep Research Agents (DRAs), featuring 250 user-task queries across 10 domains and a comprehensive PQR framework to assess personalization alignment, content quality, and factual reliability, thereby establishing a foundation for advancing personalized AI research assistants.

Authors:Zhihao Li, Chaozheng Wang, Zongjie Li, Xinyong Peng, Zelin Su, Qun Xia, Haochuan Lu, Ting Xiong, Man Ho Lam, Shuzheng Gao, Yuchong Xie, Cuiyun Gao, Shuai Wang, Yuetang Deng, Huafeng Ma
Title: JSProtect: A Scalable Obfuscation Framework for Mini-Games in WeChat
Abstract:
The WeChat mini-game ecosystem faces rampant intellectual property theft to other platforms via secondary development, yet existing JavaScript obfuscation tools are ill-equipped for large-scale applications, suffering from prohibitive processing times, severe runtime performance degradation, and unsustainable code size inflation. This paper introduces JSProtect, a high-throughput parallelized obfuscation framework designed to overcome these fundamental limitations. At the core of our framework is the Parallel-Aware Scope Analysis (PASA) algorithm, which enables two key optimizations: independent code partitioning for multi-core processing and independent namespace management that aggressively reuses short identifiers to combat code bloat. Our evaluation demonstrates that JSProtectprocesses 20MB codebases in minutes, maintaining 100\% semantic equivalence while controlling code size inflation to as low as 20\% compared to over 1,000\% with baseline tools. Furthermore, it preserves near-native runtime performance and provides superior security effectiveness against both static analysis tools and large language models. This work presents a new paradigm for industrial-scale JavaScript protection that effectively balances robust security with high performance and scalability.
中文摘要:JSProtect是一种高通量并行混淆框架,能有效保护大规模JavaScript代码免受知识产权窃取,同时保持最低的性能开销和代码膨胀率。
English Summary: JSProtect is a high-throughput parallel obfuscation framework that efficiently secures large-scale JavaScript codebases against intellectual property theft while maintaining minimal performance overhead and code size inflation.

Authors:Minh Le, Bao-Ngoc Dao, Huy Nguyen, Quyen Tran, Anh Nguyen, Nhat Ho
Title: One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning
Abstract:
Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.
中文: SMoPE是一种新颖的持续学习框架,通过稀疏专家混合架构融合任务特定提示与共享提示策略,在显著降低计算成本和参数量的同时实现了顶尖性能。
English: SMoPE is a novel continual learning framework that combines task-specific and shared prompt strategies using a sparse mixture of experts architecture, achieving state-of-the-art performance while significantly reducing computational costs and parameters.

Authors:Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, Haifeng Wang
Title: Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality
Abstract:
Hallucination and factuality deficits remain key obstacles to the reliability of large language models (LLMs) in long-form generation. Existing reinforcement learning from human feedback (RLHF) frameworks primarily rely on preference rewards, yet they often overlook the model's internal knowledge boundaries, exacerbating the so-called "hallucination tax". To address this challenge, we propose Knowledge-Level Consistency Reinforcement Learning Framework (KLCF), a novel framework that focuses on the knowledge consistency between the policy model's expressed knowledge and the base model's parametric knowledge, and introduces a Dual-Fact Alignment mechanism to jointly optimize factual recall and precision. Specifically, KLCF leverages pretrained knowledge boundaries to construct fact checklist, guiding online reinforcement learning to improve factual coverage and recall; simultaneously, it trains a self-assessment module based on the base model's internal knowledge to enhance factual precision during generation. Unlike prior methods that rely on external retrieval or heavy verification, our reward design is fully external-knowledge-free and lightweight, making KLCF efficient and easily scalable to large-scale training. Experimental results demonstrate that KLCF substantially improves factuality metrics across multiple long-form benchmarks and effectively alleviates model hallucinations.
中文摘要:提出的知识层级一致性强化学习框架(KLCF)通过双重事实对齐机制和自评估模块,将生成内容与模型内部知识进行校准,无需外部知识即可显著提升事实性指标并有效缓解幻觉问题。
English Summary: The proposed Knowledge-Level Consistency Reinforcement Learning Framework (KLCF) addresses LLM hallucinations by aligning generated content with the model's internal knowledge through dual-fact alignment and self-assessment, achieving significant factual improvements without external resources.

Authors:Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, Weijia Li
Title: Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents
Abstract:
Earth observation (EO) is essential for understanding the evolving states of the Earth system. Although recent MLLMs have advanced EO research, they still lack the capability to tackle complex tasks that require multi-step reasoning and the use of domain-specific tools. Agent-based methods offer a promising direction, but current attempts remain in their infancy, confined to RGB perception, shallow reasoning, and lacking systematic evaluation protocols. To overcome these limitations, we introduce Earth-Agent, the first agentic framework that unifies RGB and spectral EO data within an MCP-based tool ecosystem, enabling cross-modal, multi-step, and quantitative spatiotemporal reasoning beyond pretrained MLLMs. Earth-Agent supports complex scientific tasks such as geophysical parameter retrieval and quantitative spatiotemporal analysis by dynamically invoking expert tools and models across modalities. To support comprehensive evaluation, we further propose Earth-Bench, a benchmark of 248 expert-curated tasks with 13,729 images, spanning spectrum, products and RGB modalities, and equipped with a dual-level evaluation protocol that assesses both reasoning trajectories and final outcomes. We conduct comprehensive experiments varying different LLM backbones, comparisons with general agent frameworks, and comparisons with MLLMs on remote sensing benchmarks, demonstrating both the effectiveness and potential of Earth-Agent. Earth-Agent establishes a new paradigm for EO analysis, moving the field toward scientifically grounded, next-generation applications of LLMs in Earth observation. Our code and dataset will be publicly released.
Earth-Agent首次提出了一个智能体框架,统一RGB与光谱遥感数据,实现跨模态多步骤定量时空推理,并通过Earth-Bench基准测试推动地球观测分析超越现有MLLM的能力边界。
Earth-Agent introduces the first agentic framework that integrates RGB and spectral Earth observation data with multi-step reasoning capabilities, supported by the Earth-Bench benchmark for systematic evaluation, advancing EO analysis beyond current MLLM limitations.

Authors:Xiangqi Wang, Yue Huang, Yujun Zhou, Xiaonan Luo, Kehan Guo, Xiangliang Zhang
Title: Causally-Enhanced Reinforcement Policy Optimization
Abstract:
Large language models (LLMs) trained with reinforcement objectives often achieve superficially correct answers via shortcut strategies, pairing correct outputs with spurious or unfaithful reasoning and degrading under small causal perturbations. We introduce Causally-Enhanced Policy Optimization (CE-PO), a drop-in reward-shaping framework that augments policy optimization with a differentiable proxy for causal coherence along the generation pathway from prompt (Z) to rationale (X) to answer (Y). CE-PO estimates model-internal influence with Jacobian-based sensitivities, counterfactually hardens these signals to suppress nuisance cues, and fuses the resulting coherence score with task-accuracy feedback via a Minkowski (power-mean) combiner, exposing a single tunable between accuracy and coherence trade-off. The unified reward integrates with PPO/GRPO without architectural changes. Across reasoning benchmarks and causal stress tests, CE-PO reduces reward hacking and unfaithful chain-of-thought while improving robustness to correlation-causation flips and light counterfactual edits, all at near-parity accuracy. Experimental results across 4 datasets show that CE-PO improves accuracy over baselines by 5.49% on average (up to 9.58%), while improving robustness to correlation-causation flips and light counterfactual edits.
中文: CE-PO是一种新颖的奖励塑造框架,通过将因果连贯性融入推理过程来增强策略优化,有效减少不忠实的推理并提高鲁棒性,同时保持竞争力的准确性。
English: CE-PO is a novel reward-shaping framework that enhances policy optimization by integrating causal coherence into the reasoning process, effectively reducing unfaithful reasoning and improving robustness while maintaining competitive accuracy.

Authors:Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
Title: High-Quality Sound Separation Across Diverse Categories via Visually-Guided Generative Modeling
Abstract:
We propose DAVIS, a Diffusion-based Audio-VIsual Separation framework that solves the audio-visual sound source separation task through generative learning. Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS circumvents these issues by leveraging potent generative modeling paradigms, specifically Denoising Diffusion Probabilistic Models (DDPM) and the more recent Flow Matching (FM), integrated within a specialized Separation U-Net architecture. Our framework operates by synthesizing the desired separated sound spectrograms directly from a noise distribution, conditioned concurrently on the mixed audio input and associated visual information. The inherent nature of its generative objective makes DAVIS particularly adept at producing high-quality sound separations for diverse sound categories. We present comparative evaluations of DAVIS, encompassing both its DDPM and Flow Matching variants, against leading methods on the standard AVE and MUSIC datasets. The results affirm that both variants surpass existing approaches in separation quality, highlighting the efficacy of our generative framework for tackling the audio-visual source separation task.
中文: DAVIS是一种基于扩散模型和流匹配的生成式视听分离框架,通过直接从噪声分布合成分离后的声音谱图,在多种声音类别上均实现了优于现有方法的分离质量。
English: DAVIS is a generative audio-visual separation framework that uses diffusion models and flow matching to directly synthesize separated sound spectrograms, outperforming existing methods in quality across diverse sound categories.

Authors:Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen
Title: Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models
Abstract:
Fine-tuned Large Language Models (LLMs) are vulnerable to backdoor attacks through data poisoning, yet the internal mechanisms governing these attacks remain a black box. Previous research on interpretability for LLM safety tends to focus on alignment, jailbreak, and hallucination, but overlooks backdoor mechanisms, making it difficult to understand and fully eliminate the backdoor threat. In this paper, aiming to bridge this gap, we explore the interpretable mechanisms of LLM backdoors through Backdoor Attribution (BkdAttr), a tripartite causal analysis framework. We first introduce the Backdoor Probe that proves the existence of learnable backdoor features encoded within the representations. Building on this insight, we further develop Backdoor Attention Head Attribution (BAHA), efficiently pinpointing the specific attention heads responsible for processing these features. Our primary experiments reveals these heads are relatively sparse; ablating a minimal \textbf{$\sim$ 3%} of total heads is sufficient to reduce the Attack Success Rate (ASR) by \textbf{over 90%}. More importantly, we further employ these findings to construct the Backdoor Vector derived from these attributed heads as a master controller for the backdoor. Through only \textbf{1-point} intervention on \textbf{single} representation, the vector can either boost ASR up to \textbf{$\sim$ 100% ($\uparrow$)} on clean inputs, or completely neutralize backdoor, suppressing ASR down to \textbf{$\sim$ 0% ($\downarrow$)} on triggered inputs. In conclusion, our work pioneers the exploration of mechanistic interpretability in LLM backdoors, demonstrating a powerful method for backdoor control and revealing actionable insights for the community.
Chinese: 本研究提出Backdoor Attribution (BkdAttr)框架,通过识别微调大语言模型中负责后门特征的稀疏注意力头,仅需对单个表征进行单点干预即可实现后门攻击的精准增强或完全消除。
English: This study introduces Backdoor Attribution (BkdAttr), a framework that identifies sparse attention heads responsible for backdoor features in fine-tuned LLMs, enabling precise control to either amplify or neutralize attacks with minimal intervention.

Authors:Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma
Title: Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
Abstract:
Long context training is crucial for LLM's context extension. Existing schemes, such as sequence parallelism, incur substantial communication overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on partitioning granularity. Batch-level PP dividing input samples exhibits high memory consumption in long-context scenario, whereas token-level PP splitting sequences into slices alleviates memory overhead but may incur hardware under-utilization. This trade-off motivates adaptively selecting PP granularity to match resource and workload characteristics. Moreover, sequence length distribution of the real-world dataset exhibits skewness, posing a challenge on PP's workload balance and efficient scheduling. Current static PP scheduling methods overlook the variance of sequence length, leading to suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism (EPP) that orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity. We build InfiniPipe, a distributed training system that unleashes the potential of EPP via (1) a resource-aware and workload-balanced sequence processor that splits long sequences and packs short ones; and (2) a co-optimization methodology that jointly optimizes pipeline schedule and gradient checkpointing via a mechanism named stage-aware chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
中文: 本文提出弹性流水线并行(EPP)及其实现系统InfiniPipe,通过动态调整并行粒度和协同优化调度策略,有效应对长上下文训练中序列长度不均的挑战,相比现有系统实现了1.69倍的加速。
English: This paper introduces Elastic Pipeline Parallelism (EPP) and its implementation, InfiniPipe, which dynamically adapts pipeline parallelism granularity and optimizes scheduling to efficiently handle long-context training with skewed sequence lengths, achieving a 1.69x speedup over existing systems.

Authors:Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu
Title: Differential-Integral Neural Operator for Long-Term Turbulence Forecasting
Abstract:
Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.
中文: 提出的微分-积分神经算子(DINO)通过将湍流动态分解为局部微分和全局积分算子,解决了长期湍流预测的难题,在二维柯尔莫哥洛夫流实验中展现出卓越的稳定性和精度。
English: The proposed Differential-Integral Neural Operator (DINO) addresses long-term turbulence forecasting by decomposing turbulent dynamics into local differential and global integral operators, achieving superior stability and accuracy in 2D Kolmogorov flow experiments.

Authors:Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Qingsong Wen, Kun Wang, Xiaomeng Huang, Xian Wu
Title: Differential-Integral Neural Operator for Long-Term Turbulence Forecasting
Abstract:
Accurately forecasting the long-term evolution of turbulence represents a grand challenge in scientific computing and is crucial for applications ranging from climate modeling to aerospace engineering. Existing deep learning methods, particularly neural operators, often fail in long-term autoregressive predictions, suffering from catastrophic error accumulation and a loss of physical fidelity. This failure stems from their inability to simultaneously capture the distinct mathematical structures that govern turbulent dynamics: local, dissipative effects and global, non-local interactions. In this paper, we propose the {\textbf{\underline{D}}}ifferential-{\textbf{\underline{I}}}ntegral {\textbf{\underline{N}}}eural {\textbf{\underline{O}}}perator (\method{}), a novel framework designed from a first-principles approach of operator decomposition. \method{} explicitly models the turbulent evolution through parallel branches that learn distinct physical operators: a local differential operator, realized by a constrained convolutional network that provably converges to a derivative, and a global integral operator, captured by a Transformer architecture that learns a data-driven global kernel. This physics-based decomposition endows \method{} with exceptional stability and robustness. Through extensive experiments on the challenging 2D Kolmogorov flow benchmark, we demonstrate that \method{} significantly outperforms state-of-the-art models in long-term forecasting. It successfully suppresses error accumulation over hundreds of timesteps, maintains high fidelity in both the vorticity fields and energy spectra, and establishes a new benchmark for physically consistent, long-range turbulence forecast.
中文: 提出的微分-积分神经算子(DINO)通过将湍流动态分解为局部微分和全局积分算子,解决了长期湍流预测的难题,在二维柯尔莫哥洛夫流实验中展现出卓越的稳定性和精度。
English: The proposed Differential-Integral Neural Operator (DINO) addresses long-term turbulence forecasting by decomposing turbulent dynamics into local differential and global integral operators, achieving superior stability and accuracy in 2D Kolmogorov flow experiments.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
Abstract:
Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.
中文摘要:SummQ是一种新颖的对抗性多智能体框架,通过总结与问答双领域的智能体协作,有效解决了长文档摘要中的信息丢失和一致性问题,在多项评估中显著优于现有最优方法。
English Summary: SummQ is an adversarial multi-agent framework that enhances long document summarization by employing collaborative agents for summarization and quizzing, significantly outperforming existing methods across multiple benchmarks and evaluation metrics.

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Title: Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
Abstract:
Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.
中文摘要:SummQ是一种新颖的对抗性多智能体框架,通过总结与问答双领域的智能体协作,有效解决了长文档摘要中的信息丢失和一致性问题,在多项评估中显著优于现有最优方法。
English Summary: SummQ is an adversarial multi-agent framework that enhances long document summarization by employing collaborative agents for summarization and quizzing, significantly outperforming existing methods across multiple benchmarks and evaluation metrics.

Authors:Tamer Ahmed Eltaras, Qutaibah Malluhi, Alessandro Savino, Stefano Di Carlo, Adnan Qayyum
Title: R-CONV++: Uncovering Privacy Vulnerabilities through Analytical Gradient Inversion Attacks
Abstract:
Federated learning has emerged as a prominent privacy-preserving technique for leveraging large-scale distributed datasets by sharing gradients instead of raw data. However, recent studies indicate that private training data can still be exposed through gradient inversion attacks. While earlier analytical methods have demonstrated success in reconstructing input data from fully connected layers, their effectiveness significantly diminishes when applied to convolutional layers, high-dimensional inputs, and scenarios involving multiple training examples. This paper extends our previous work \cite{eltaras2024r} and proposes three advanced algorithms to broaden the applicability of gradient inversion attacks. The first algorithm presents a novel data leakage method that efficiently exploits convolutional layer gradients, demonstrating that even with non-fully invertible activation functions, such as ReLU, training samples can be analytically reconstructed directly from gradients without the need to reconstruct intermediate layer outputs. Building on this foundation, the second algorithm extends this analytical approach to support high-dimensional input data, substantially enhancing its utility across complex real-world datasets. The third algorithm introduces an innovative analytical method for reconstructing mini-batches, addressing a critical gap in current research that predominantly focuses on reconstructing only a single training example. Unlike previous studies that focused mainly on the weight constraints of convolutional layers, our approach emphasizes the pivotal role of gradient constraints, revealing that successful attacks can be executed with fewer than 5\% of the constraints previously deemed necessary in certain layers.
Chinese: 本文提出了三种先进算法,通过有效利用卷积层梯度显著增强了梯度反转攻击能力,实现了高维输入和小批量的数据重建,所需约束条件远少于以往研究。
English: This paper introduces three advanced algorithms that significantly enhance gradient inversion attacks by efficiently exploiting convolutional layer gradients, enabling the reconstruction of high-dimensional inputs and mini-batches with far fewer constraints than previously required.

Authors:Tamer Ahmed Eltaras, Qutaibah Malluhi, Alessandro Savino, Stefano Di Carlo, Adnan Qayyum
Title: Uncovering Privacy Vulnerabilities through Analytical Gradient Inversion Attacks
Abstract:
Federated learning has emerged as a prominent privacy-preserving technique for leveraging large-scale distributed datasets by sharing gradients instead of raw data. However, recent studies indicate that private training data can still be exposed through gradient inversion attacks. While earlier analytical methods have demonstrated success in reconstructing input data from fully connected layers, their effectiveness significantly diminishes when applied to convolutional layers, high-dimensional inputs, and scenarios involving multiple training examples. This paper extends our previous work \cite{eltaras2024r} and proposes three advanced algorithms to broaden the applicability of gradient inversion attacks. The first algorithm presents a novel data leakage method that efficiently exploits convolutional layer gradients, demonstrating that even with non-fully invertible activation functions, such as ReLU, training samples can be analytically reconstructed directly from gradients without the need to reconstruct intermediate layer outputs. Building on this foundation, the second algorithm extends this analytical approach to support high-dimensional input data, substantially enhancing its utility across complex real-world datasets. The third algorithm introduces an innovative analytical method for reconstructing mini-batches, addressing a critical gap in current research that predominantly focuses on reconstructing only a single training example. Unlike previous studies that focused mainly on the weight constraints of convolutional layers, our approach emphasizes the pivotal role of gradient constraints, revealing that successful attacks can be executed with fewer than 5\% of the constraints previously deemed necessary in certain layers.
Chinese: 本文提出了三种先进算法,通过有效利用卷积层梯度显著增强了梯度反转攻击能力,实现了高维输入和小批量的数据重建,所需约束条件远少于以往研究。
English: This paper introduces three advanced algorithms that significantly enhance gradient inversion attacks by efficiently exploiting convolutional layer gradients, enabling the reconstruction of high-dimensional inputs and mini-batches with far fewer constraints than previously required.

Authors:Zhiyu Kan, Wensheng Gan, Zhenlian Qi, Philip S. Yu
Title: Advances in Large Language Models for Medicine
Abstract:
Artificial intelligence (AI) technology has advanced rapidly in recent years, with large language models (LLMs) emerging as a significant breakthrough. LLMs are increasingly making an impact across various industries, with the medical field standing out as the most prominent application area. This paper systematically reviews the up-to-date research progress of LLMs in the medical field, providing an in-depth analysis of training techniques for large medical models, their adaptation in healthcare settings, related applications, as well as their strengths and limitations. Furthermore, it innovatively categorizes medical LLMs into three distinct types based on their training methodologies and classifies their evaluation approaches into two categories. Finally, the study proposes solutions to existing challenges and outlines future research directions based on identified issues in the field of medical LLMs. By systematically reviewing previous and advanced research findings, we aim to highlight the necessity of developing medical LLMs, provide a deeper understanding of their current state of development, and offer clear guidance for subsequent research.
中文: 本文系统综述了大语言模型在医学领域的最新进展,分析了其训练方法、应用场景与局限性,并提出创新性分类及未来研究方向,为医疗大模型的深入发展提供清晰指引。
English: This paper systematically reviews the latest progress of large language models (LLMs) in medicine, analyzing their training, applications, and limitations while proposing innovative classifications and future research directions to guide development in the field.

Authors:Xinyu Mu, Hui Dou, Furao Shen, Jian Zhao
Title: ConceptFlow: Hierarchical and Fine-grained Concept-Based Explanation for Convolutional Neural Networks
Abstract:
Concept-based interpretability for Convolutional Neural Networks (CNNs) aims to align internal model representations with high-level semantic concepts, but existing approaches largely overlook the semantic roles of individual filters and the dynamic propagation of concepts across layers. To address these limitations, we propose ConceptFlow, a concept-based interpretability framework that simulates the internal "thinking path" of a model by tracing how concepts emerge and evolve across layers. ConceptFlow comprises two key components: (i) concept attentions, which associate each filter with relevant high-level concepts to enable localized semantic interpretation, and (ii) conceptual pathways, derived from a concept transition matrix that quantifies how concepts propagate and transform between filters. Together, these components offer a unified and structured view of internal model reasoning. Experimental results demonstrate that ConceptFlow yields semantically meaningful insights into model reasoning, validating the effectiveness of concept attentions and conceptual pathways in explaining decision behavior. By modeling hierarchical conceptual pathways, ConceptFlow provides deeper insight into the internal logic of CNNs and supports the generation of more faithful and human-aligned explanations.
Chinese: ConceptFlow提出了一种基于概念的可解释性框架,通过概念注意力和概念通路追踪概念在CNN各层的演化,为模型推理提供结构化洞察并生成与人类认知一致的解释。
English: ConceptFlow introduces a concept-based interpretability framework that traces concept evolution across CNN layers through concept attentions and conceptual pathways, offering structured insights into model reasoning and generating human-aligned explanations.

Authors:Sara Papi, Dennis Fucci, Marco Gaido, Matteo Negri, Luisa Bentivogli
Title: Cross-Attention is Half Explanation in Speech-to-Text Models
Abstract:
Cross-attention is a core mechanism in encoder-decoder architectures, widespread in many fields, including speech-to-text (S2T) processing. Its scores have been repurposed for various downstream applications--such as timestamp estimation and audio-text alignment--under the assumption that they reflect the dependencies between input speech representation and the generated text. While the explanatory nature of attention mechanisms has been widely debated in the broader NLP literature, this assumption remains largely unexplored within the speech domain. To address this gap, we assess the explanatory power of cross-attention in S2T models by comparing its scores to input saliency maps derived from feature attribution. Our analysis spans monolingual and multilingual, single-task and multi-task models at multiple scales, and shows that attention scores moderately to strongly align with saliency-based explanations, particularly when aggregated across heads and layers. However, it also shows that cross-attention captures only about 50% of the input relevance and, in the best case, only partially reflects how the decoder attends to the encoder's representations--accounting for just 52-75% of the saliency. These findings uncover fundamental limitations in interpreting cross-attention as an explanatory proxy, suggesting that it offers an informative yet incomplete view of the factors driving predictions in S2T models.
中文摘要:语音转文本模型中的交叉注意力机制虽与基于显著性的解释存在中等到强相关性,但仅能捕捉约一半的输入相关性,这揭示了其作为模型预测完整解释性指标的局限性。
English Summary: Cross-attention in speech-to-text models shows moderate to strong alignment with saliency-based explanations but captures only about half of the input relevance, revealing its limitations as a complete explanatory proxy for model predictions.

Authors:Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh V Chawla, Olaf Wiest, Xiangliang Zhang
Title: ChemOrch: Empowering LLMs with Chemical Intelligence via Synthetic Instructions
Abstract:
Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning and distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.
中文摘要:ChemOrch框架通过任务控制指令生成和工具感知响应构建的两阶段过程,生成化学基础扎实的指令-响应对,以可控的多样性和难度提升大型语言模型的化学智能,显著增强了其在化学领域的表现。
English Summary: ChemOrch is a framework designed to enhance large language models' chemical intelligence by generating chemically accurate instruction-response pairs through a two-stage process, ensuring task diversity and response precision, which significantly improves LLM performance in chemistry.

Authors:Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar, Panchal Nayak, Priyabrata Mallick, Swarup Ranjan Behera, Parabattina Bhagath, Pailla Balakrishna Reddy, Arun Balaji Buduru
Title: Investigating Polyglot Speech Foundation Models for Learning Collective Emotion from Crowds
Abstract:
This paper investigates the polyglot (multilingual) speech foundation models (SFMs) for Crowd Emotion Recognition (CER). We hypothesize that polyglot SFMs, pre-trained on diverse languages, accents, and speech patterns, are particularly adept at navigating the noisy and complex acoustic environments characteristic of crowd settings, thereby offering a significant advantage for CER. To substantiate this, we perform a comprehensive analysis, comparing polyglot, monolingual, and speaker recognition SFMs through extensive experiments on a benchmark CER dataset across varying audio durations (1 sec, 500 ms, and 250 ms). The results consistently demonstrate the superiority of polyglot SFMs, outperforming their counterparts across all audio lengths and excelling even with extremely short-duration inputs. These findings pave the way for adaptation of SFMs in setting up new benchmarks for CER.
中文: 本研究证明,基于多语言预训练的语音基础模型在人群情绪识别任务中,无论音频时长长短均显著优于单语言及说话人识别模型,为建立新基准铺平了道路。
English: This study demonstrates that polyglot speech foundation models, pre-trained on diverse linguistic data, significantly outperform monolingual and speaker recognition models in crowd emotion recognition across various audio durations, including extremely short inputs.

Authors:Qingfeng Zhou, Wensheng Gan, Zhenlian Qi, Philip S. Yu
Title: Utility-based Privacy Preserving Data Mining
Abstract:
With the advent of big data, periodic pattern mining has demonstrated significant value in real-world applications, including smart home systems, healthcare systems, and the medical field. However, advances in network technology have enabled malicious actors to extract sensitive information from publicly available datasets, posing significant threats to data providers and, in severe cases, hindering societal development. To mitigate such risks, privacy-preserving utility mining (PPUM) has been proposed. However, PPUM is unsuitable for addressing privacy concerns in periodic information mining. To address this issue, we innovatively extend the existing PPUM framework and propose two algorithms, Maximum sensitive Utility-MAximum maxPer item (MU-MAP) and Maximum sensitive Utility-MInimum maxPer item (MU-MIP). These algorithms aim to hide sensitive periodic high-utility itemsets while generating sanitized datasets. To enhance the efficiency of the algorithms, we designed two novel data structures: the Sensitive Itemset List (SISL) and the Sensitive Item List (SIL), which store essential information about sensitive itemsets and their constituent items. Moreover, several performance metrics were employed to evaluate the performance of our algorithms compared to the state-of-the-art PPUM algorithms. The experimental results show that our proposed algorithms achieve an Artificial Cost (AC) value of 0 on all datasets when hiding sensitive itemsets. In contrast, the traditional PPUM algorithm yields non-zero AC. This indicates that our algorithms can successfully hide sensitive periodic itemsets without introducing misleading patterns, whereas the PPUM algorithm generates additional itemsets that may interfere with user decision-making. Moreover, the results also reveal that our algorithms maintain Database Utility Similarity (DUS) of over 90\% after the sensitive itemsets are hidden.
中文摘要:本研究提出了MU-MAP和MU-MIP两种创新算法,在隐藏敏感周期性高效用项集的同时保持超过90%的数据库效用相似度,其性能优于会产生误导性模式的传统隐私保护效用挖掘方法。
English Summary: This study introduces two novel algorithms, MU-MAP and MU-MIP, which effectively hide sensitive periodic high-utility itemsets in datasets while maintaining over 90% database utility similarity, outperforming traditional privacy-preserving utility mining methods that generate misleading patterns.

Authors:Fei Ni, Min Zhang, Pengyi Li, Yifu Yuan, Lingfeng Zhang, Yuecheng Liu, Peilong Han, Longxin Kou, Shaojin Ma, Jinbin Qiao, David Gamaliel Arcos Bravo, Yuening Wang, Xiao Hu, Zhanguang Zhang, Xianze Yao, Yutong Li, Zhao Zhang, Ying Wen, Ying-Cong Chen, Xiaodan Liang, Liang Lin, Bin He, Haitham Bou-Ammar, He Wang, Huazhe Xu, Jiankang Deng, Shan Luo, Shuqiang Jiang, Wei Pan, Yang Gao, Stefanos Zafeiriou, Jan Peters, Yuzheng Zhuang, Yingxue Zhang, Yan Zheng, Hongyao Tang, Jianye Hao
Title: Embodied Arena: A Comprehensive, Unified, and Evolving Evaluation Platform for Embodied AI
Abstract:
Embodied AI development significantly lags behind large foundation models due to three critical challenges: (1) lack of systematic understanding of core capabilities needed for Embodied AI, making research lack clear objectives; (2) absence of unified and standardized evaluation systems, rendering cross-benchmark evaluation infeasible; and (3) underdeveloped automated and scalable acquisition methods for embodied data, creating critical bottlenecks for model scaling. To address these obstacles, we present Embodied Arena, a comprehensive, unified, and evolving evaluation platform for Embodied AI. Our platform establishes a systematic embodied capability taxonomy spanning three levels (perception, reasoning, task execution), seven core capabilities, and 25 fine-grained dimensions, enabling unified evaluation with systematic research objectives. We introduce a standardized evaluation system built upon unified infrastructure supporting flexible integration of 22 diverse benchmarks across three domains (2D/3D Embodied Q&A, Navigation, Task Planning) and 30+ advanced models from 20+ worldwide institutes. Additionally, we develop a novel LLM-driven automated generation pipeline ensuring scalable embodied evaluation data with continuous evolution for diversity and comprehensiveness. Embodied Arena publishes three real-time leaderboards (Embodied Q&A, Navigation, Task Planning) with dual perspectives (benchmark view and capability view), providing comprehensive overviews of advanced model capabilities. Especially, we present nine findings summarized from the evaluation results on the leaderboards of Embodied Arena. This helps to establish clear research veins and pinpoint critical research problems, thereby driving forward progress in the field of Embodied AI.
中文: 具身智能面临三大挑战——核心能力不明确、评估体系不统一、数据扩展性不足,而Embodied Arena平台通过构建系统化能力分类、标准化评估基准和自动化数据生成,为领域发展提供了统一解决方案。
English: Embodied AI faces three major challenges—unclear core capabilities, lack of standardized evaluation, and limited data scalability—which are addressed by Embodied Arena, a unified platform offering systematic capability taxonomy, standardized benchmarks, and automated data generation to advance the field.

Authors:Brian Yan, Injy Hamed, Shuichiro Shimizu, Vasista Lodagala, William Chen, Olga Iakovenko, Bashar Talafha, Amir Hussein, Alexander Polok, Kalvin Chang, Dominik Klement, Sara Althubaiti, Puyuan Peng, Matthew Wiesner, Thamar Solorio, Ahmed Ali, Sanjeev Khudanpur, Shinji Watanabe, Chih-Chen Chen, Zhen Wu, Karim Benharrak, Anuj Diwan, Samuele Cornell, Eunjung Yeo, Kwanghee Choi, Carlos Carvalho, Karen Rosero
Title: CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset
Abstract:
We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 {Arabic, Mandarin, Hindi, Spanish}-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: https://huggingface.co/datasets/byan/cs-fleurs.
中文:CS-FLEURS是一个新型数据集,旨在推动语码转换语音识别与翻译研究,包含涵盖52种语言共113对语言组合的四个测试集和128小时训练集,以拓宽研究范围。
English: CS-FLEURS is a novel dataset designed to advance code-switched speech recognition and translation, featuring four test sets covering 113 language pairs across 52 languages and a 128-hour training set to expand research accessibility.

Authors:Nhi Kieu, Kien Nguyen, Arnold Wiliem, Clinton Fookes, Sridha Sridharan
Title: Filling the Gaps: A Multitask Hybrid Multiscale Generative Framework for Missing Modality in Remote Sensing Semantic Segmentation
Abstract:
Multimodal learning has shown significant performance boost compared to ordinary unimodal models across various domains. However, in real-world scenarios, multimodal signals are susceptible to missing because of sensor failures and adverse weather conditions, which drastically deteriorates models' operation and performance. Generative models such as AutoEncoder (AE) and Generative Adversarial Network (GAN) are intuitive solutions aiming to reconstruct missing modality from available ones. Yet, their efficacy in remote sensing semantic segmentation remains underexplored. In this paper, we first examine the limitations of existing generative approaches in handling the heterogeneity of multimodal remote sensing data. They inadequately capture semantic context in complex scenes with large intra-class and small inter-class variation. In addition, traditional generative models are susceptible to heavy dependence on the dominant modality, introducing bias that affects model robustness under missing modality conditions. To tackle these limitations, we propose a novel Generative-Enhanced MultiModal learning Network (GEMMNet) with three key components: (1) Hybrid Feature Extractor (HyFEx) to effectively learn modality-specific representations, (2) Hybrid Fusion with Multiscale Awareness (HyFMA) to capture modality-synergistic semantic context across scales and (3) Complementary Loss (CoLoss) scheme to alleviate the inherent bias by encouraging consistency across modalities and tasks. Our method, GEMMNet, outperforms both generative baselines AE, cGAN (conditional GAN), and state-of-the-art non-generative approaches - mmformer and shaspec - on two challenging semantic segmentation remote sensing datasets (Vaihingen and Potsdam). Source code is made available.
中文摘要:针对多模态学习中数据缺失及现有生成模型不足的问题,本文提出GEMMNet模型,通过混合特征提取与多尺度融合等创新设计,在遥感语义分割任务中超越了现有生成式与非生成式方法。
English Summary: Multimodal learning faces challenges from missing data and existing generative models' limitations, prompting the proposed GEMMNet with specialized components that outperforms current methods in remote sensing segmentation.

Authors:Jiayi Ye, Chaoran Chen, Yue Huang, Yanfang Ye, Toby Jia-Jun Li, Xiangliang Zhang
Title: My Favorite Streamer is an LLM: Discovering, Bonding, and Co-Creating in AI VTuber Fandom
Abstract:
AI VTubers, where the performer is not human but algorithmically generated, introduce a new context for fandom. While human VTubers have been substantially studied for their cultural appeal, parasocial dynamics, and community economies, little is known about how audiences engage with their AI counterparts. To address this gap, we present a qualitative study of Neuro-sama, the most prominent AI VTuber. Our findings show that engagement is anchored in active co-creation: audiences are drawn by the AI's unpredictable yet entertaining interactions, cement loyalty through collective emotional events that trigger anthropomorphic projection, and sustain attachment via the AI's consistent persona. Financial support emerges not as a reward for performance but as a participatory mechanism for shaping livestream content, establishing a resilient fan economy built on ongoing interaction. These dynamics reveal how AI Vtuber fandom reshapes fan-creator relationships and offer implications for designing transparent and sustainable AI-mediated communities.
中文: AI虚拟主播通过不可预测的互动、情感拟人化投射和参与式经济支持重塑粉丝关系,其粉丝经济建立在持续互动基础上,为构建可持续的AI社群提供新范式。
English: AI VTubers foster audience engagement through unpredictable interactions, emotional anthropomorphic projection, and participatory financial support, reshaping fan-creator relationships and community economies.

Authors:Jiahao Wang, Yufeng Yuan, Rujie Zheng, Youtian Lin, Jian Gao, Lin-Zhuo Chen, Yajie Bao, Yi Zhang, Chang Zeng, Yanxi Zhou, Xiaoxiao Long, Hao Zhu, Zhaoxiang Zhang, Xun Cao, Yao Yao
Title: SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Abstract:
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID's data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.
Chinese: 空间智能虽取得重大进展,但受限于训练数据不足;为此开发了SpatialVID数据集,包含大量多样化视频和丰富3D标注,有效提升模型泛化能力与性能。
English: Significant advances in spatial intelligence are hindered by limited training data, leading to the creation of SpatialVID, a large-scale dataset with diverse videos and rich 3D annotations to enhance model generalization and performance.

Authors:Gianluca Amprimo, Alberto Ancilotto, Alessandro Savino, Fabio Quazzolo, Claudia Ferraris, Gabriella Olmo, Elisabetta Farella, Stefano Di Carlo
Title: EHWGesture -- A dataset for multimodal understanding of clinical gestures
Abstract:
Hand gesture understanding is essential for several applications in human-computer interaction, including automatic clinical assessment of hand dexterity. While deep learning has advanced static gesture recognition, dynamic gesture understanding remains challenging due to complex spatiotemporal variations. Moreover, existing datasets often lack multimodal and multi-view diversity, precise ground-truth tracking, and an action quality component embedded within gestures. This paper introduces EHWGesture, a multimodal video dataset for gesture understanding featuring five clinically relevant gestures. It includes over 1,100 recordings (6 hours), captured from 25 healthy subjects using two high-resolution RGB-Depth cameras and an event camera. A motion capture system provides precise ground-truth hand landmark tracking, and all devices are spatially calibrated and synchronized to ensure cross-modal alignment. Moreover, to embed an action quality task within gesture understanding, collected recordings are organized in classes of execution speed that mirror clinical evaluations of hand dexterity. Baseline experiments highlight the dataset's potential for gesture classification, gesture trigger detection, and action quality assessment. Thus, EHWGesture can serve as a comprehensive benchmark for advancing multimodal clinical gesture understanding.
中文摘要:EHWGesture是一个包含临床相关手势的多模态视频数据集,具备精确手部追踪和同步多视角记录,旨在推动人机交互中动态手势理解与动作质量评估的研究发展。
English Summary: EHWGesture is a multimodal video dataset featuring clinically relevant gestures with precise hand tracking and synchronized multi-view recordings, designed to advance dynamic gesture understanding and action quality assessment in human-computer interaction.

Authors:Daniel Scholz, Ayhan Can Erdur, Viktoria Ehm, Anke Meyer-Baese, Jan C. Peeken, Daniel Rueckert, Benedikt Wiestler
Title: MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis
Abstract:
Vision foundation models like DINOv2 demonstrate remarkable potential in medical imaging despite their origin in natural image domains. However, their design inherently works best for uni-modal image analysis, limiting their effectiveness for multi-modal imaging tasks that are common in many medical fields, such as neurology and oncology. While supervised models perform well in this setting, they fail to leverage unlabeled datasets and struggle with missing modalities, a frequent challenge in clinical settings. To bridge these gaps, we introduce MM-DINOv2, a novel and efficient framework that adapts the pre-trained vision foundation model DINOv2 for multi-modal medical imaging. Our approach incorporates multi-modal patch embeddings, enabling vision foundation models to effectively process multi-modal imaging data. To address missing modalities, we employ full-modality masking, which encourages the model to learn robust cross-modality relationships. Furthermore, we leverage semi-supervised learning to harness large unlabeled datasets, enhancing both the accuracy and reliability of medical predictions. Applied to glioma subtype classification from multi-sequence brain MRI, our method achieves a Matthews Correlation Coefficient (MCC) of 0.6 on an external test set, surpassing state-of-the-art supervised approaches by +11.1%. Our work establishes a scalable and robust solution for multi-modal medical imaging tasks, leveraging powerful vision foundation models pre-trained on natural images while addressing real-world clinical challenges such as missing data and limited annotations.
Chinese: MM-DINOv2框架通过引入多模态补丁嵌入和全模态掩码技术,成功将DINOv2视觉基础模型适配于多模态医学影像分析,在胶质瘤亚型分类任务中以11.1%的优势超越现有最优方法。
English: The MM-DINOv2 framework effectively adapts the DINOv2 vision foundation model for multi-modal medical imaging by incorporating multi-modal patch embeddings and full-modality masking, achieving superior performance in glioma subtype classification with an 11.1% improvement over state-of-the-art methods.

Authors:Daniel Scholz, Ayhan Can Erdur, Robbie Holland, Viktoria Ehm, Jan C. Peeken, Benedikt Wiestler, Daniel Rueckert
Title: Contrastive Anatomy-Contrast Disentanglement: A Domain-General MRI Harmonization Method
Abstract:
Magnetic resonance imaging (MRI) is an invaluable tool for clinical and research applications. Yet, variations in scanners and acquisition parameters cause inconsistencies in image contrast, hindering data comparability and reproducibility across datasets and clinical studies. Existing scanner harmonization methods, designed to address this challenge, face limitations, such as requiring traveling subjects or struggling to generalize to unseen domains. We propose a novel approach using a conditioned diffusion autoencoder with a contrastive loss and domain-agnostic contrast augmentation to harmonize MR images across scanners while preserving subject-specific anatomy. Our method enables brain MRI synthesis from a single reference image. It outperforms baseline techniques, achieving a +7% PSNR improvement on a traveling subjects dataset and +18% improvement on age regression in unseen. Our model provides robust, effective harmonization of brain MRIs to target scanners without requiring fine-tuning. This advancement promises to enhance comparability, reproducibility, and generalizability in multi-site and longitudinal clinical studies, ultimately contributing to improved healthcare outcomes.
中文摘要:该研究提出的扩散自编码器方法通过单张参考图像有效统一不同扫描仪的脑部MRI对比度,无需微调即可显著提升图像质量与泛化能力。
English Summary: The proposed diffusion autoencoder method effectively harmonizes brain MRI contrasts across different scanners using a single reference image, significantly improving image quality and generalizability without fine-tuning.

Authors:Yuchong Xie, Mingyu Luo, Zesen Liu, Zhixiang Zhang, Kaikai Zhang, Yu Liu, Zongjie Li, Ping Chen, Shuai Wang, Dongdong She
Title: On the Security of Tool-Invocation Prompts for LLM-Based Agentic Systems: An Empirical Risk Assessment
Abstract:
LLM-based agentic systems leverage large language models to handle user queries, make decisions, and execute external tools for complex tasks across domains like chatbots, customer service, and software engineering. A critical component of these systems is the Tool Invocation Prompt (TIP), which defines tool interaction protocols and guides LLMs to ensure the security and correctness of tool usage. Despite its importance, TIP security has been largely overlooked. This work investigates TIP-related security risks, revealing that major LLM-based systems like Cursor, Claude Code, and others are vulnerable to attacks such as remote code execution (RCE) and denial of service (DoS). Through a systematic TIP exploitation workflow (TEW), we demonstrate external tool behavior hijacking via manipulated tool invocations. We also propose defense mechanisms to enhance TIP security in LLM-based agentic systems.
中文: 本研究揭示了基于大语言模型的智能体系统中工具调用提示(TIP)的安全漏洞,通过系统化攻击流程演示了远程代码执行等攻击如何劫持工具行为,同时提出了增强TIP安全性的防御机制。
English: This study exposes critical security vulnerabilities in LLM-based agentic systems' Tool Invocation Prompts (TIPs), demonstrating how attacks like remote code execution can hijack tool behaviors through systematic exploitation, while also proposing defense mechanisms to strengthen TIP security.

Authors:Qi Chen, Jingxuan Wei, Zhuoya Yao, Haiguang Wang, Gaowei Wu, Bihui Yu, Siyuan Li, Cheng Tan
Title: ResearchPulse: Building Method-Experiment Chains through Multi-Document Scientific Inference
Abstract:
Understanding how scientific ideas evolve requires more than summarizing individual papers-it demands structured, cross-document reasoning over thematically related research. In this work, we formalize multi-document scientific inference, a new task that extracts and aligns motivation, methodology, and experimental results across related papers to reconstruct research development chains. This task introduces key challenges, including temporally aligning loosely structured methods and standardizing heterogeneous experimental tables. We present ResearchPulse, an agent-based framework that integrates instruction planning, scientific content extraction, and structured visualization. It consists of three coordinated agents: a Plan Agent for task decomposition, a Mmap-Agent that constructs motivation-method mind maps, and a Lchart-Agent that synthesizes experimental line charts. To support this task, we introduce ResearchPulse-Bench, a citation-aware benchmark of annotated paper clusters. Experiments show that our system, despite using 7B-scale agents, consistently outperforms strong baselines like GPT-4o in semantic alignment, structural consistency, and visual fidelity. The dataset are available in https://huggingface.co/datasets/ResearchPulse/ResearchPulse-Bench.
中文总结:本文提出ResearchPulse框架,通过协调多个智能体对相关论文进行动机、方法和结果的跨文档对齐,重构研究发展链条,实验表明其在使用较小模型的情况下仍优于GPT-4o等基线。
English Summary: This paper introduces ResearchPulse, a multi-document scientific inference framework using coordinated agents to reconstruct research development chains by aligning motivations, methods, and results across related papers, demonstrating superior performance over GPT-4o despite using smaller models.

Authors:Weiyuan Li, Xintao Wang, Siyu Yuan, Rui Xu, Jiangjie Chen, Qingqing Dong, Yanghua Xiao, Deqing Yang
Title: Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges
Abstract:
As large language models (LLMs) grow more capable, they face increasingly diverse and complex tasks, making reliable evaluation challenging. The paradigm of LLMs as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings. Their reliability in complex tasks--where multi-faceted rubrics, unstructured reference answers, and nuanced criteria are critical--remains understudied. In this paper, we constructed ComplexEval, a challenge benchmark designed to systematically expose and quantify Auxiliary Information Induced Biases. We systematically investigated and validated 6 previously unexplored biases across 12 basic and 3 advanced scenarios. Key findings reveal: (1) all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity; (2) notably, Large Reasoning Models (LRMs) show paradoxical vulnerability. Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals, paving the way for more general and robust evaluation models.
中文摘要:该研究通过ComplexEval基准揭示,大语言模型在复杂任务中极易受多种偏见影响,且任务越复杂偏见越显著,为提升评估模型的准确性和鲁棒性提供了关键洞见。
English Summary: The study introduces ComplexEval, a benchmark revealing that large language models are highly susceptible to biases in complex tasks, with bias severity increasing alongside task complexity.

Authors:Yewen Li, Jingtong Gao, Nan Jiang, Shuai Mao, Ruyi An, Fei Pan, Xiangyu Zhao, Bo An, Qingpeng Cai, Peng Jiang
Title: Generative Auto-Bidding in Large-Scale Competitive Auctions via Diffusion Completer-Aligner
Abstract:
Auto-bidding is central to computational advertising, achieving notable commercial success by optimizing advertisers' bids within economic constraints. Recently, large generative models show potential to revolutionize auto-bidding by generating bids that could flexibly adapt to complex, competitive environments. Among them, diffusers stand out for their ability to address sparse-reward challenges by focusing on trajectory-level accumulated rewards, as well as their explainable capability, i.e., planning a future trajectory of states and executing bids accordingly. However, diffusers struggle with generation uncertainty, particularly regarding dynamic legitimacy between adjacent states, which can lead to poor bids and further cause significant loss of ad impression opportunities when competing with other advertisers in a highly competitive auction environment. To address it, we propose a Causal auto-Bidding method based on a Diffusion completer-aligner framework, termed CBD. Firstly, we augment the diffusion training process with an extra random variable t, where the model observes t-length historical sequences with the goal of completing the remaining sequence, thereby enhancing the generated sequences' dynamic legitimacy. Then, we employ a trajectory-level return model to refine the generated trajectories, aligning more closely with advertisers' objectives. Experimental results across diverse settings demonstrate that our approach not only achieves superior performance on large-scale auto-bidding benchmarks, such as a 29.9% improvement in conversion value in the challenging sparse-reward auction setting, but also delivers significant improvements on the Kuaishou online advertising platform, including a 2.0% increase in target cost.
Auto-bidding is evolving with generative models like diffusers, which face challenges in dynamic legitimacy but are enhanced by the proposed CBD method, improving performance in competitive ad auctions.
English Summary:

Authors:Yanwen Wang, Yiyu Zhuang, Jiawei Zhang, Li Wang, Yifei Zeng, Xun Cao, Xinxin Zuo, Hao Zhu
Title: TeRA: Rethinking Text-guided Realistic 3D Avatar Generation
Abstract:
In this paper, we rethink text-to-avatar generative models by proposing TeRA, a more efficient and effective framework than the previous SDS-based models and general large 3D generative models. Our approach employs a two-stage training strategy for learning a native 3D avatar generative model. Initially, we distill a decoder to derive a structured latent space from a large human reconstruction model. Subsequently, a text-controlled latent diffusion model is trained to generate photorealistic 3D human avatars within this latent space. TeRA enhances the model performance by eliminating slow iterative optimization and enables text-based partial customization through a structured 3D human representation. Experiments have proven our approach's superiority over previous text-to-avatar generative models in subjective and objective evaluation.
中文: 本文提出TeRA框架,通过构建结构化潜空间和训练扩散模型的两阶段方法,显著提升了文本到虚拟人生成的效率和真实感,实验证明其性能优于现有模型。
English: This paper introduces TeRA, a two-stage framework that enhances text-to-avatar generation by creating a structured latent space and training a diffusion model, resulting in faster, more realistic 3D human avatars with superior performance.

Authors:Yuqing Chen, Junjie Wang, Lin Liu, Ruihang Chu, Xiaopeng Zhang, Qi Tian, Yujiu Yang
Title: O-DisCo-Edit: Object Distortion Control for Unified Realistic Video Editing
Abstract:
Diffusion models have recently advanced video editing, yet controllable editing remains challenging due to the need for precise manipulation of diverse object properties. Current methods require different control signal for diverse editing tasks, which complicates model design and demands significant training resources. To address this, we propose O-DisCo-Edit, a unified framework that incorporates a novel object distortion control (O-DisCo). This signal, based on random and adaptive noise, flexibly encapsulates a wide range of editing cues within a single representation. Paired with a "copy-form" preservation module for preserving non-edited regions, O-DisCo-Edit enables efficient, high-fidelity editing through an effective training paradigm. Extensive experiments and comprehensive human evaluations consistently demonstrate that O-DisCo-Edit surpasses both specialized and multitask state-of-the-art methods across various video editing tasks. https://cyqii.github.io/O-DisCo-Edit.github.io/
中文:O-DisCo-Edit提出了一种统一框架,结合物体扭曲控制和复制形态保护模块,能在多种视频编辑任务中实现灵活且高保真的编辑效果,在实验和人工评估中均优于现有方法。
English: O-DisCo-Edit introduces a unified framework with object distortion control and a copy-form preservation module to enable flexible, high-fidelity video editing across diverse tasks, outperforming existing methods in experiments and human evaluations.

Authors:Sheng Lin, Fangcheng Fu, Haoyang Li, Hao Ge, Xuanyu Wang, Jiawen Niu, Yaofeng Tu, Bin Cui
Title: LobRA: Multi-tenant Fine-tuning over Heterogeneous Data
Abstract:
With the breakthrough of Transformer-based pre-trained models, the demand for fine-tuning (FT) to adapt the base pre-trained models to downstream applications continues to grow, so it is essential for service providers to reduce the cost of processing FT requests. Low-rank adaption (LoRA) is a widely used FT technique that only trains small-scale adapters and keeps the base model unaltered, conveying the possibility of processing multiple FT tasks by jointly training different LoRA adapters with a shared base model. Nevertheless, through in-depth analysis, we reveal the efficiency of joint FT is dampened by two heterogeneity issues in the training data -- the sequence length variation and skewness. To tackle these issues, we develop LobRA, a brand new framework that supports processing multiple FT tasks by jointly training LoRA adapters. Two innovative designs are introduced. Firstly, LobRA deploys the FT replicas (i.e., model replicas for FT) with heterogeneous resource usages and parallel configurations, matching the diverse workloads caused by the sequence length variation. Secondly, for each training step, LobRA takes account of the sequence length skewness and dispatches the training data among the heterogeneous FT replicas to achieve workload balance. We conduct experiments to assess the performance of LobRA, validating that it significantly reduces the GPU seconds required for joint FT by 45.03%-60.67%.
The LobRA framework addresses efficiency challenges in joint fine-tuning of multiple LoRA adapters by deploying heterogeneous model replicas and workload-balanced data dispatch, reducing GPU time by 45.03%-60.67%.
English Summary:

Authors:Yi-Cheng Lin, Yu-Hua Chen, Jia-Kai Dong, Yueh-Hsuan Huang, Szu-Chi Chen, Yu-Chen Chen, Chih-Yao Chen, Yu-Jung Lin, Yu-Ling Chen, Zih-Yu Chen, I-Ning Tsai, Hsiu-Hsuan Wang, Ho-Lam Chung, Ke-Han Lu, Hung-yi Lee
Title: TAU: A Benchmark for Cultural Sound Understanding Beyond Semantics
Abstract:
Large audio-language models are advancing rapidly, yet most evaluations emphasize speech or globally sourced sounds, overlooking culturally distinctive cues. This gap raises a critical question: can current models generalize to localized, non-semantic audio that communities instantly recognize but outsiders do not? To address this, we present TAU (Taiwan Audio Understanding), a benchmark of everyday Taiwanese "soundmarks." TAU is built through a pipeline combining curated sources, human editing, and LLM-assisted question generation, producing 702 clips and 1,794 multiple-choice items that cannot be solved by transcripts alone. Experiments show that state-of-the-art LALMs, including Gemini 2.5 and Qwen2-Audio, perform far below local humans. TAU demonstrates the need for localized benchmarks to reveal cultural blind spots, guide more equitable multimodal evaluation, and ensure models serve communities beyond the global mainstream.
Chinese Summary: TAU基准测试通过评估模型对台湾本土标志性声音的识别能力,揭示了大型音频语言模型存在文化盲区,即使最先进的模型表现也远逊于当地人,凸显了建立文化包容性评估体系的必要性。
English Summary: The TAU benchmark exposes cultural blind spots in large audio-language models by testing their ability to recognize localized Taiwanese soundmarks, revealing that even state-of-the-art models perform significantly worse than local humans and highlighting the need for culturally inclusive evaluations.

Authors:Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu
Title: MuSLR: Multimodal Symbolic Logical Reasoning
Abstract:
Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.
中文:MuSLR基准旨在评估视觉语言模型的多模态符号逻辑推理能力,发现现有模型表现不佳,并提出了LogiCAM模块化框架,通过应用形式逻辑规则显著提升了多模态推理性能。
English: The MuSLR benchmark is introduced to evaluate multimodal symbolic logical reasoning in vision-language models, revealing their limitations and proposing LogiCAM, a modular framework that significantly enhances performance by applying formal logic rules to multimodal inputs.

Authors:Hechuan Guo, Yongle Hao, Yue Zhang, Minghui Xu, Peizhuo Lyu, Jiezhi Chen, Xiuzhen Cheng
Title: A Measurement Study of Model Context Protocol
Abstract:
The Model Context Protocol (MCP) has been proposed as a unifying standard for connecting large language models (LLMs) with external tools and resources, promising the same role for AI integration that HTTP and USB played for the Web and peripherals. Yet, despite rapid adoption and hype, its trajectory remains uncertain. Are MCP marketplaces truly growing, or merely inflated by placeholders and abandoned prototypes? Are servers secure and privacy-preserving, or do they expose users to systemic risks? And do clients converge on standardized protocols, or remain fragmented across competing designs? In this paper, we present the first large-scale empirical study of the MCP ecosystem. We design and implement MCPCrawler, a systematic measurement framework that collects and normalizes data from six major markets. Over a 14-day campaign, MCPCrawler aggregated 17,630 raw entries, of which 8,401 valid projects (8,060 servers and 341 clients) were analyzed. Our results reveal that more than half of listed projects are invalid or low-value, that servers face structural risks including dependency monocultures and uneven maintenance, and that clients exhibit a transitional phase in protocol and connection patterns. Together, these findings provide the first evidence-based view of the MCP ecosystem, its risks, and its future trajectory.
中文: 针对模型上下文协议(MCP)生态系统的首次大规模实证研究表明,超过一半的列示项目无效或价值低下,服务器面临依赖单一化和维护不均等结构性风险,客户端处于协议与连接模式的过渡阶段,为评估其风险与未来走向提供了实证依据。
English: This first large-scale empirical study of the Model Context Protocol (MCP) ecosystem reveals that over half of listed projects are invalid or low-value, servers face structural risks like dependency monocultures, and clients are in a transitional phase, providing an evidence-based view of its risks and future trajectory.

Authors:Haoming Wen, Yushi Bai, Juanzi Li, Jie Tang
Title: SIRI: Scaling Iterative Reinforcement Learning with Interleaved Compression
Abstract:
We introduce SIRI, Scaling Iterative Reinforcement Learning with Interleaved Compression, a simple yet effective RL approach for Large Reasoning Models (LRMs) that enables more efficient and accurate reasoning. Existing studies have observed repetitive thinking patterns in LRMs, and attempts to reduce them often come at the cost of performance. In this paper, we show that this trade-off can be overcome through a training regime that iteratively alternates between compressing and expanding the reasoning budget, by dynamically adjusting the maximum rollout length during training. The compression phase cuts the rollout length, forcing the model to make precise and valuable decisions within a limited context, which effectively reduces redundant tokens and increases reasoning density. The expansion phase then relaxes the length limit, providing space for the model to explore and plan in long-horizon settings. Remarkably, we find that after each compression-expansion cycle, the model's performance improves even as its output length decreases, steadily pushing it closer to the Pareto frontier in the performance-efficiency trade-off. Training on DeepSeek-R1-Distill-Qwen-1.5B, SIRI-low improves performance on AIME24 by 43.2% while reducing token usage by 46.9% after three iterations, and SIRI-high achieves the highest accuracy compared to all other methods (Figure 1). Our findings shed light on the potential of periodically oscillating the LRM's output truncation length during training to dynamically balance exploration and efficiency in reasoning, converging towards an optimal "sweet spot" between the two. Our models are publicly available.
中文: SIRI是一种通过在训练中交替压缩和扩展推理步骤的强化学习方法,使大型推理模型能够动态平衡探索与效率,从而用更少的标记实现更高的准确性。
English: SIRI is a reinforcement learning method that alternates between compressing and expanding reasoning steps during training, enabling large reasoning models to achieve higher accuracy with fewer tokens by dynamically balancing exploration and efficiency.

Authors:Yaozu Wu, Jizhou Guo, Dongyuan Li, Henry Peng Zou, Wei-Chieh Huang, Yankai Chen, Zhen Wang, Weizhi Zhang, Yangning Li, Meng Zhang, Renhe Jiang, Philip S. Yu
Title: PSG-Agent: Personality-Aware Safety Guardrail for LLM-based Agents
Abstract:
Effective guardrails are essential for safely deploying LLM-based agents in critical applications. Despite recent advances, existing guardrails suffer from two fundamental limitations: (i) they apply uniform guardrail policies to all users, ignoring that the same agent behavior can harm some users while being safe for others; (ii) they check each response in isolation, missing how risks evolve and accumulate across multiple interactions. To solve these issues, we propose PSG-Agent, a personalized and dynamic system for LLM-based agents. First, PSG-Agent creates personalized guardrails by mining the interaction history for stable traits and capturing real-time states from current queries, generating user-specific risk thresholds and protection strategies. Second, PSG-Agent implements continuous monitoring across the agent pipeline with specialized guards, including Plan Monitor, Tool Firewall, Response Guard, Memory Guardian, that track cross-turn risk accumulation and issue verifiable verdicts. Finally, we validate PSG-Agent in multiple scenarios including healthcare, finance, and daily life automation scenarios with diverse user profiles. It significantly outperform existing agent guardrails including LlamaGuard3 and AGrail, providing an executable and auditable path toward personalized safety for LLM-based agents.
中文摘要:PSG-Agent通过挖掘交互历史生成个性化防护策略,并在智能体全流程部署持续监控机制,解决了现有防护系统忽略用户差异和跨对话风险累积的问题,为关键应用中的LLM智能体提供了可执行、可审计的安全保障。
English Summary: PSG-Agent introduces personalized and dynamic safety mechanisms for LLM-based agents by creating user-specific guardrails through interaction analysis and implementing continuous monitoring across the agent pipeline to address limitations of uniform policies and isolated response checks.

Authors:Xiaomin Cao, Mohammadali Mohammadi, Hien Quoc Ngo, Hyundong Shin, Michail Matthaiou
Title: RIS-Assisted XL-MIMO for Near-Field and Far-Field Communications
Abstract:
We consider a reconfigurable intelligent surface (RIS)-assisted extremely large-scale multiple-input multiple-output (XL-MIMO) downlink system, where an XL-MIMO array serves two groups of single-antennas users, namely near-field users (NFUEs) and far-field users (FFUEs). FFUEs are subject to blockage, and their communication is facilitated through the RIS. We consider three precoding schemes at the XL-MIMO array, namely central zero-forcing (CZF), local zero-forcing (LZF) and maximum ratio transmission (MRT). Closed-form expressions for the spectral efficiency (SE) of all users are derived for MRT precoding, while statistical-form expressions are obtained for CZF and LZF processing. A heuristic visibility region (VR) selection algorithm is also introduced to help reduce the computational complexity of the precoding scheme. Furthermore, we devise a two-stage phase shifts design and power control algorithm to maximize the sum of weighted minimum SE of two groups of users with CZF, LZF and MRT precoding schemes. The simulation results indicate that, when equal priority is given to NFUEs and FFUEs, the proposed design improves the sum of the weighted minimum SE by 31.9\%, 37.8\%, and 119.2\% with CZF, LZF, and MRT, respectively, compared to the case with equal power allocation and random phase shifts design. CZF achieves the best performance, while LZF offers comparable results with lower complexity. When prioritizing NFUEs or FFUEs, LZF achieves strong performance for the prioritized group, whereas CZF ensures balanced performance between NFUEs and FFUEs.
中文摘要:本文提出了一种智能超表面辅助的XL-MIMO系统,通过设计预编码方案和优化算法,在降低计算复杂度的同时显著提升了近场与远场用户的频谱效率。
English Summary: This paper presents a reconfigurable intelligent surface-assisted XL-MIMO system serving near-field and far-field users, developing precoding schemes and optimization algorithms that significantly enhance spectral efficiency while reducing computational complexity.

Authors:Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei
Title: Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
Abstract:
Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.
中文: 提出的混合视觉思维(MoVT)范式在单一模型中统一多种推理模式,并通过AdaVaR框架实现基于上下文的动态选择,在各类视觉推理场景中均取得了稳定提升。
English: The proposed Mixture-of-Visual-Thoughts (MoVT) paradigm unifies multiple reasoning modes in a single model and adaptively selects them based on context using the AdaVaR framework, achieving consistent improvements across diverse visual reasoning scenarios.

Authors:Lovenya Jain, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan
Title: Investigating Faithfulness in Large Audio Language Models
Abstract:
Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model's decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.
中文摘要:本研究通过多种干预方法评估大型音频语言模型中思维链推理的忠实性,发现其解释过程与模型决策机制基本一致。
English Summary: The study evaluates the faithfulness of chain-of-thought reasoning in large audio-language models, finding through targeted interventions that their explanations generally align with the models' decision processes.

Authors:Cehao Yang, Xiaojun Wu, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Jia Li, Hui Xiong, Jian Guo
Title: GraphSearch: An Agentic Deep Searching Workflow for Graph Retrieval-Augmented Generation
Abstract:
Graph Retrieval-Augmented Generation (GraphRAG) enhances factual reasoning in LLMs by structurally modeling knowledge through graph-based representations. However, existing GraphRAG approaches face two core limitations: shallow retrieval that fails to surface all critical evidence, and inefficient utilization of pre-constructed structural graph data, which hinders effective reasoning from complex queries. To address these challenges, we propose \textsc{GraphSearch}, a novel agentic deep searching workflow with dual-channel retrieval for GraphRAG. \textsc{GraphSearch} organizes the retrieval process into a modular framework comprising six modules, enabling multi-turn interactions and iterative reasoning. Furthermore, \textsc{GraphSearch} adopts a dual-channel retrieval strategy that issues semantic queries over chunk-based text data and relational queries over structural graph data, enabling comprehensive utilization of both modalities and their complementary strengths. Experimental results across six multi-hop RAG benchmarks demonstrate that \textsc{GraphSearch} consistently improves answer accuracy and generation quality over the traditional strategy, confirming \textsc{GraphSearch} as a promising direction for advancing graph retrieval-augmented generation.
中文: GraphSearch 提出了一种具有双通道检索功能的智能深度搜索工作流程,解决了 GraphRAG 中检索浅层化和图数据利用效率低的问题,在多个基准测试中显著提升了推理准确性和生成质量。
English: GraphSearch introduces an agentic deep searching workflow with dual-channel retrieval to overcome the limitations of shallow retrieval and inefficient graph data use in GraphRAG, significantly enhancing reasoning accuracy and generation quality across benchmarks.

Authors:Xiaojun Wu, Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Yuanliang Sun, Hui Xiong, Jia Li, Jian Guo
Title: Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval
Abstract:
Retrieval-Augmented Generation (RAG) and Graph-based RAG has become the important paradigm for enhancing Large Language Models (LLMs) with external knowledge. However, existing approaches face a fundamental trade-off. While graph-based methods are inherently dependent on high-quality graph structures, they face significant practical constraints: manually constructed knowledge graphs are prohibitively expensive to scale, while automatically extracted graphs from corpora are limited by the performance of the underlying LLM extractors, especially when using smaller, local-deployed models. This paper presents Think-on-Graph 3.0 (ToG-3), a novel framework that introduces Multi-Agent Context Evolution and Retrieval (MACER) mechanism to overcome these limitations. Our core innovation is the dynamic construction and refinement of a Chunk-Triplets-Community heterogeneous graph index, which pioneeringly incorporates a dual-evolution mechanism of Evolving Query and Evolving Sub-Graph for precise evidence retrieval. This approach addresses a critical limitation of prior Graph-based RAG methods, which typically construct a static graph index in a single pass without adapting to the actual query. A multi-agent system, comprising Constructor, Retriever, Reflector, and Responser agents, collaboratively engages in an iterative process of evidence retrieval, answer generation, sufficiency reflection, and, crucially, evolving query and subgraph. This dual-evolving multi-agent system allows ToG-3 to adaptively build a targeted graph index during reasoning, mitigating the inherent drawbacks of static, one-time graph construction and enabling deep, precise reasoning even with lightweight LLMs. Extensive experiments demonstrate that ToG-3 outperforms compared baselines on both deep and broad reasoning benchmarks, and ablation studies confirm the efficacy of the components of MACER framework.
中文: ToG-3通过创新的多智能体框架和查询与子图双重演化机制,解决了基于图的检索增强生成方法中静态图构建的局限性,使得轻量级语言模型也能实现精准推理。
English: ToG-3 introduces a dynamic multi-agent framework with dual-evolving query and subgraph mechanisms to overcome the limitations of static graph construction in retrieval-augmented generation, enabling precise reasoning even with lightweight language models.

Authors:Gabrijel Boduljak, Laurynas Karazija, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Title: What Happens Next? Anticipating Future Motion by Generating Point Trajectories
Abstract:
We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. We extensively evaluate our method on simulated data, demonstrate its effectiveness on downstream applications such as robotics, and show promising accuracy on real-world intuitive physics datasets. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.
中文: 本研究提出一种从单张图像预测物体运动的方法,通过生成密集轨迹网格,在精度和多样性上优于现有技术,并在机器人学和真实物理场景应用中展现出有效性。
English: This study introduces a model for forecasting object motion from a single image by generating dense trajectory grids, which outperforms existing methods in accuracy and diversity and demonstrates effectiveness in robotics and real-world physics applications.

Authors:Hsiao-Ying Huang, Yi-Cheng Lin, Hung-yi Lee
Title: MI-Fuse: Label Fusion for Unsupervised Domain Adaptation with Closed-Source Large-Audio Language Model
Abstract:
Large audio-language models (LALMs) show strong zero-shot ability on speech tasks, suggesting promise for speech emotion recognition (SER). However, SER in real-world deployments often fails under domain mismatch, where source data are unavailable and powerful LALMs are accessible only through an API. We ask: given only unlabeled target-domain audio and an API-only LALM, can a student model be adapted to outperform the LALM in the target domain? To this end, we propose MI-Fuse, a denoised label fusion framework that supplements the LALM with a source-domain trained SER classifier as an auxiliary teacher. The framework draws multiple stochastic predictions from both teachers, weights their mean distributions by mutual-information-based uncertainty, and stabilizes training with an exponential moving average teacher. Experiments across three public emotion datasets and six cross-domain transfers show consistent gains, with the student surpassing the LALM and outperforming the strongest baseline by 3.9%. This approach strengthens emotion-aware speech systems without sharing source data, enabling realistic adaptation.
中文:MI-Fuse框架通过双教师的去噪标签融合,使学生在跨领域语音情感识别中超越大型音频语言模型,无需源域数据即可实现3.9%的性能提升。
English: The MI-Fuse framework enables a student model to outperform large audio-language models in cross-domain speech emotion recognition by leveraging denoised label fusion from dual teachers, achieving 3.9% improvement without requiring source data access.

Authors:Yixin Liu, Yonghui Wu, Denghui Zhang, Lichao Sun
Title: Agentic AutoSurvey: Let LLMs Survey LLMs
Abstract:
The exponential growth of scientific literature poses unprecedented challenges for researchers attempting to synthesize knowledge across rapidly evolving fields. We present \textbf{Agentic AutoSurvey}, a multi-agent framework for automated survey generation that addresses fundamental limitations in existing approaches. Our system employs four specialized agents (Paper Search Specialist, Topic Mining \& Clustering, Academic Survey Writer, and Quality Evaluator) working in concert to generate comprehensive literature surveys with superior synthesis quality. Through experiments on six representative LLM research topics from COLM 2024 categories, we demonstrate that our multi-agent approach achieves significant improvements over existing baselines, scoring 8.18/10 compared to AutoSurvey's 4.77/10. The multi-agent architecture processes 75--443 papers per topic (847 total across six topics) while targeting high citation coverage (often $\geq$80\% on 75--100-paper sets; lower on very large sets such as RLHF) through specialized agent orchestration. Our 12-dimension evaluation captures organization, synthesis integration, and critical analysis beyond basic metrics. These findings demonstrate that multi-agent architectures represent a meaningful advancement for automated literature survey generation in rapidly evolving scientific domains.
中文: Agentic AutoSurvey是一种多智能体框架,通过协同运作四个专业智能体,在快速发展的科学领域中实现了文献综述生成质量的显著提升,展现出卓越的综合能力与覆盖范围。
English: Agentic AutoSurvey is a multi-agent framework that significantly enhances automated literature survey generation by employing specialized agents to achieve superior synthesis quality and comprehensive coverage across rapidly evolving scientific fields.

Authors:Matthieu Cervera, Francesco Paissan, Mirco Ravanelli, Cem Subakan
Title: Virtual Consistency for Audio Editing
Abstract:
Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches rely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.
Chinese: 本文提出了一种基于虚拟一致性的音频编辑系统,通过改进扩散模型采样过程实现了显著加速,在保持质量的同时无需模型微调或结构调整,大幅提升了处理效率。
English: This paper introduces a virtual-consistency based audio editing system that accelerates processing by adapting diffusion model sampling, achieving significant speed improvements without quality loss while remaining model-agnostic.

Authors:Jinyuan Qu, Hongyang Li, Xingyu Chen, Shilong Liu, Yukai Shi, Tianhe Ren, Ruitao Jing, Lei Zhang
Title: SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
Abstract:
In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.
Chinese: SegDINO3D是一种创新的Transformer框架,通过融合预训练2D模型的图像和对象级特征来提升3D实例分割性能,在ScanNetV2和ScanNet200基准测试中取得了领先水平。
English: SegDINO3D is an advanced Transformer-based framework that enhances 3D instance segmentation by integrating 2D image and object-level features from pre-trained models, achieving state-of-the-art results on benchmarks like ScanNetV2 and ScanNet200.

Authors:Xuanjun Chen, Chia-Yu Hu, I-Ming Lin, Yi-Cheng Lin, I-Hsiang Chiu, You Zhang, Sung-Feng Huang, Yi-Hsuan Yang, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
Title: How Does Instrumental Music Help SingFake Detection?
Abstract:
Although many models exist to detect singing voice deepfakes (SingFake), how these models operate, particularly with instrumental accompaniment, is unclear. We investigate how instrumental music affects SingFake detection from two perspectives. To investigate the behavioral effect, we test different backbones, unpaired instrumental tracks, and frequency subbands. To analyze the representational effect, we probe how fine-tuning alters encoders' speech and music capabilities. Our results show that instrumental accompaniment acts mainly as data augmentation rather than providing intrinsic cues (e.g., rhythm or harmony). Furthermore, fine-tuning increases reliance on shallow speaker features while reducing sensitivity to content, paralinguistic, and semantic information. These insights clarify how models exploit vocal versus instrumental cues and can inform the design of more interpretable and robust SingFake detection systems.
中文: 器乐伴奏在歌声深度伪造检测中主要起数据增强作用而非提供内在音乐线索,同时模型微调会增强对表面说话人特征的依赖而降低对内容与语义信息的敏感性。
English: Instrumental accompaniment in singing voice deepfake detection primarily serves as data augmentation rather than providing intrinsic musical cues, while fine-tuning shifts model focus toward shallow speaker features at the expense of content and semantic understanding.

Authors:Javier Conde, María Grandury, Tairan Fu, Carlos Arriaga, Gonzalo Martínez, Thomas Clark, Sean Trott, Clarence Gerald Green, Pedro Reviriego, Marc Brysbaert
Title: Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratings
Abstract:
Word-level psycholinguistic norms lend empirical support to theories of language processing. However, obtaining such human-based measures is not always feasible or straightforward. One promising approach is to augment human norming datasets by using Large Language Models (LLMs) to predict these characteristics directly, a practice that is rapidly gaining popularity in psycholinguistics and cognitive science. However, the novelty of this approach (and the relative inscrutability of LLMs) necessitates the adoption of rigorous methodologies that guide researchers through this process, present the range of possible approaches, and clarify limitations that are not immediately apparent, but may, in some cases, render the use of LLMs impractical. In this work, we present a comprehensive methodology for estimating word characteristics with LLMs, enriched with practical advice and lessons learned from our own experience. Our approach covers both the direct use of base LLMs and the fine-tuning of models, an alternative that can yield substantial performance gains in certain scenarios. A major emphasis in the guide is the validation of LLM-generated data with human "gold standard" norms. We also present a software framework that implements our methodology and supports both commercial and open-weight models. We illustrate the proposed approach with a case study on estimating word familiarity in English. Using base models, we achieved a Spearman correlation of 0.8 with human ratings, which increased to 0.9 when employing fine-tuned models. This methodology, framework, and set of best practices aim to serve as a reference for future research on leveraging LLMs for psycholinguistic and lexical studies.
中文摘要:本研究提出了一套利用大语言模型预测词汇心理语言学特征的综合方法及软件框架,强调需通过人类标准数据进行严格验证,并通过案例研究表明微调模型与人类评分相关性可达0.9,为相关研究提供了重要参考。
English Summary: This study introduces a comprehensive methodology and software framework for using Large Language Models (LLMs) to predict word-level psycholinguistic characteristics, emphasizing rigorous validation against human norms and demonstrating through a case study that fine-tuned models can achieve correlations up to 0.9 with human ratings.

Authors:Zhipeng Bian, Jieming Zhu, Xuyang Xie, Quanyu Dai, Zhou Zhao, Zhenhua Dong
Title: MIRA: Empowering One-Touch AI Services on Smartphones with MLLM-based Instruction Recommendation
Abstract:
The rapid advancement of generative AI technologies is driving the integration of diverse AI-powered services into smartphones, transforming how users interact with their devices. To simplify access to predefined AI services, this paper introduces MIRA, a pioneering framework for task instruction recommendation that enables intuitive one-touch AI tasking on smartphones. With MIRA, users can long-press on images or text objects to receive contextually relevant instruction recommendations for executing AI tasks. Our work introduces three key innovations: 1) A multimodal large language model (MLLM)-based recommendation pipeline with structured reasoning to extract key entities, infer user intent, and generate precise instructions; 2) A template-augmented reasoning mechanism that integrates high-level reasoning templates, enhancing task inference accuracy; 3) A prefix-tree-based constrained decoding strategy that restricts outputs to predefined instruction candidates, ensuring coherent and intent-aligned suggestions. Through evaluation using a real-world annotated datasets and a user study, MIRA has demonstrated substantial improvements in the accuracy of instruction recommendation. The encouraging results highlight MIRA's potential to revolutionize the way users engage with AI services on their smartphones, offering a more seamless and efficient experience.
中文: 本文提出MIRA框架,通过多模态分析和结构化推理为智能手机提供情境相关的指令推荐,实现一键式AI任务操作,显著提升了用户与AI服务的交互体验。
English: This paper introduces MIRA, a framework that enables intuitive one-touch AI tasking on smartphones by recommending contextually relevant instructions through multimodal analysis and structured reasoning, significantly improving user interaction with AI services.

Authors:Andrea Piergentili, Beatrice Savoldi, Matteo Negri, Luisa Bentivogli
Title: Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-offs
Abstract:
Gender-neutral rewriting (GNR) aims to reformulate text to eliminate unnecessary gender specifications while preserving meaning, a particularly challenging task in grammatical-gender languages like Italian. In this work, we conduct the first systematic evaluation of state-of-the-art large language models (LLMs) for Italian GNR, introducing a two-dimensional framework that measures both neutrality and semantic fidelity to the input. We compare few-shot prompting across multiple LLMs, fine-tune selected models, and apply targeted cleaning to boost task relevance. Our findings show that open-weight LLMs outperform the only existing model dedicated to GNR in Italian, whereas our fine-tuned models match or exceed the best open-weight LLM's performance at a fraction of its size. Finally, we discuss the trade-off between optimizing the training data for neutrality and meaning preservation.
中文: 本研究提出了意大利语性别中性改写的首个系统评估框架,证明经过微调的开源模型能以更小规模超越专用模型性能,同时优化了中立性与语义保真度之间的平衡。
English: This study introduces a systematic evaluation framework for gender-neutral rewriting in Italian, demonstrating that fine-tuned open-weight language models can outperform dedicated models while balancing neutrality and semantic preservation.

Authors:Yihao Guo, Haoming Zhu, Minghui Xu, Xiuzhen Cheng, Bin Xiao
Title: xRWA: A Cross-Chain Framework for Interoperability of Real-World Assets
Abstract:
Real-World Assets (RWAs) have recently attracted increasing attention as a means of bridging traditional financial instruments with decentralized infrastructures. By representing assets such as bonds, commodities, and real estate on blockchains, RWAs can enhance liquidity, broaden accessibility, and extend the scope of decentralized finance. Industry forecasts further suggest rapid growth of tokenized RWAs in the coming years, underscoring their potential role in the evolution of digital financial markets. However, when deployed across multiple blockchains, RWAs face challenges such as repeated authentication on different chains and inefficiency caused by multi-step settlement protocols. To address these issues, we present a cross-chain framework for RWAs that emphasizes identity management, authentication, and interaction. The framework integrates Decentralized Identifiers and Verifiable Credentials with customized attributes to support decentralized identification, and incorporates an authentication protocol based on Simplified Payment Verification to avoid redundant verification across chains. Furthermore, we design a cross-chain channel that enables the settlement of RWAs without requiring channel closure, thereby improving operational efficiency. We implement the framework and evaluate it through simulations, which confirm its feasibility and demonstrate improvements in efficiency for RWAs in cross-chain settings.
中文: 本文提出了一种现实世界资产(RWA)的跨链框架,通过集成去中心化身份管理和认证协议,消除重复验证并优化多链结算,从而提升运行效率。
English: This paper introduces a cross-chain framework for Real-World Assets (RWAs) that enhances efficiency by integrating decentralized identity management and authentication protocols to eliminate redundant verifications and streamline multi-chain settlements.

Authors:Tairan Fu, David Campo-Nazareno, Javier Coronado-Blázquez, Javier Conde, Pedro Reviriego, Fabrizio Lombardi
Title: Stochastic Streets: A Walk Through Random LLM Address Generation in four European Cities
Abstract:
Large Language Models (LLMs) are capable of solving complex math problems or answer difficult questions on almost any topic, but can they generate random street addresses for European cities?
中文: 大型语言模型能够解决各种复杂问题,但其生成欧洲城市随机街道地址的能力尚不明确。
English: Large Language Models can solve complex problems across various topics, but their ability to generate random street addresses for European cities remains uncertain.

Authors:Jiayun Wang, Yousuf Aborahama, Arya Khokhar, Yang Zhang, Chuwei Wang, Karteekeya Sastry, Julius Berner, Yilin Luo, Boris Bonev, Zongyi Li, Kamyar Azizzadenesheli, Lihong V. Wang, Anima Anandkumar
Title: Accelerating 3D Photoacoustic Computed Tomography with End-to-End Physics-Aware Neural Operators
Abstract:
Photoacoustic computed tomography (PACT) combines optical contrast with ultrasonic resolution, achieving deep-tissue imaging beyond the optical diffusion limit. While three-dimensional PACT systems enable high-resolution volumetric imaging for applications spanning transcranial to breast imaging, current implementations require dense transducer arrays and prolonged acquisition times, limiting clinical translation. We introduce Pano (PACT imaging neural operator), an end-to-end physics-aware model that directly learns the inverse acoustic mapping from sensor measurements to volumetric reconstructions. Unlike existing approaches (e.g. universal back-projection algorithm), Pano learns both physics and data priors while also being agnostic to the input data resolution. Pano employs spherical discrete-continuous convolutions to preserve hemispherical sensor geometry, incorporates Helmholtz equation constraints to ensure physical consistency and operates resolutionindependently across varying sensor configurations. We demonstrate the robustness and efficiency of Pano in reconstructing high-quality images from both simulated and real experimental data, achieving consistent performance even with significantly reduced transducer counts and limited-angle acquisition configurations. The framework maintains reconstruction fidelity across diverse sparse sampling patterns while enabling real-time volumetric imaging capabilities. This advancement establishes a practical pathway for making 3D PACT more accessible and feasible for both preclinical research and clinical applications, substantially reducing hardware requirements without compromising image reconstruction quality.
中文: Pano是一种端到端的物理感知神经算子,能够从稀疏传感器数据重建高质量的三维光声图像,在显著降低硬件需求的同时保持成像保真度,实现实时体积成像。
English: Pano is an end-to-end physics-aware neural operator that reconstructs high-quality 3D photoacoustic images from sparse sensor data, enabling real-time volumetric imaging with reduced hardware requirements while maintaining fidelity.

Authors:Chao Huang, Fengran Mo, Yufeng Chen, Changhao Guan, Zhenrui Yue, Xinyu Wang, Jinan Xu, Kaiyu Huang
Title: Boosting Data Utilization for Multilingual Dense Retrieval
Abstract:
Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.
Chinese: 本研究提出了一种通过获取高质量困难负样本和有效小批量数据来提升多语言稠密检索中数据利用率的方法,在涵盖16种语言的MIRACL基准测试中显著超越了现有强基线模型。
English: This study introduces a method to enhance multilingual dense retrieval by improving data utilization through high-quality hard negative samples and effective mini-batch data, achieving superior performance over strong baselines across 16 languages in the MIRACL benchmark.

Authors:Haochen Huang, Shuzhang Zhong, Zhe Zhang, Shuangchen Li, Dimin Niu, Hongzhong Zheng, Runsheng Wang, Meng Li
Title: HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing
Abstract:
Large Language Models (LLMs) with Mixture-of-Expert (MoE) architectures achieve superior model performance with reduced computation costs, but at the cost of high memory capacity and bandwidth requirements. Near-Memory Processing (NMP) accelerators that stack memory directly on the compute through hybrid bonding have demonstrated high bandwidth with high energy efficiency, becoming a promising architecture for MoE models. However, as NMP accelerators comprise distributed memory and computation, how to map the MoE computation directly determines the LLM inference efficiency. Existing parallel mapping strategies, including Tensor Parallelism (TP) and Expert Parallelism (EP), suffer from either high communication costs or unbalanced computation utilization, leading to inferior efficiency. The dynamic routing mechanism of MoE LLMs further aggravates the efficiency challenges. Therefore, in this paper, we propose HD-MoE to automatically optimize the MoE parallel computation across an NMP accelerator. HD-MoE features an offline automatic hybrid parallel mapping algorithm and an online dynamic scheduling strategy to reduce the communication costs while maximizing the computation utilization. With extensive experimental results, we demonstrate that HD-MoE achieves a speedup ranging from 1.1x to 1.8x over TP, 1.1x to 1.5x over EP, and 1.0x to 1.4x over the baseline Hybrid TP-EP with Compute-Balanced parallelism strategies.
中文:HD-MoE通过混合映射和动态调度优化NMP加速器上的MoE并行计算,在降低通信成本并最大化计算利用率的同时,相比现有策略实现了显著的性能提升。
English: HD-MoE optimizes MoE parallel computation on NMP accelerators through hybrid mapping and dynamic scheduling, achieving significant speedups over existing strategies by reducing communication costs and maximizing computational utilization.

Authors:Jin Zhao, Pengfei Wang, Shuangmin Chen, Jiong Guo, Shiqing Xin, Changhe Tu, Wenping Wang
Title: Toward Precise Curve Offsetting Constrained to Parametric Surfaces
Abstract:
Computing offsets of curves on parametric surfaces is a fundamental yet challenging operation in computer aided design and manufacturing. Traditional analytical approaches suffer from time-consuming geodesic distance queries and complex self intersection handling, while discrete methods often struggle with precision. In this paper, we propose a totally different algorithm paradigm. Our key insight is that by representing the source curve as a sequence of line segment primitives, the Voronoi decomposition constrained to the parametric surface enables localized offset computation. Specifically, the offsetting process can be efficiently traced by independently visiting the corresponding Voronoi cells. To address the challenge of computing the Voronoi decomposition on parametric surfaces, we introduce two key techniques. First, we employ intrinsic triangulation in the parameter space to accurately capture geodesic distances. Second, instead of directly computing the surface-constrained Voronoi decomposition, we decompose the triangulated parameter plane using a series of plane cutting operations. Experimental results demonstrate that our algorithm achieves superior accuracy and runtime performance compared to existing methods. We also present several practical applications enabled by our approach.
中文: 本文提出一种在参数曲面上计算曲线偏移的新算法,通过将曲线表示为线段并利用内在三角剖分和平面切割的Voronoi分解来高效追踪局部偏移,相比现有方法具有更高的精度和速度。
English: This paper introduces a novel algorithm for computing curve offsets on parametric surfaces by representing curves as line segments and leveraging Voronoi decomposition with intrinsic triangulation and plane cutting to efficiently trace localized offsets, achieving higher accuracy and speed than existing methods.

Authors:Pengfei Wang, Yuexin Yang, Shuangmin Chen, Shiqing Xin, Changhe Tu, Wenping Wang
Title: Swept Volume Computation with Enhanced Geometric Detail Preservation
Abstract:
Swept volume computation, the determination of regions occupied by moving objects, is essential in graphics, robotics, and manufacturing. Existing approaches either explicitly track surfaces, suffering from robustness issues under complex interactions, or employ implicit representations that trade off geometric fidelity and face optimization difficulties. We propose a novel inversion of motion perspective: rather than tracking object motion, we fix the object and trace spatial points backward in time, reducing complex trajectories to efficiently linearizable point motions. Based on this, we introduce a multi field tetrahedral framework that maintains multiple distance fileds per element, preserving fine geometric details at trajectory intersections where single field methods fail. Our method robustly computes swept volumes for diverse motions, including translations and screw motions, and enables practical applications in path planning and collision detection.
中文摘要:本文提出了一种新颖的扫描体计算方法,通过反转运动视角并采用多场四面体框架,能够稳健处理复杂运动并保持几何细节,适用于路径规划和碰撞检测等实际应用。
English Summary: This paper introduces a novel method for swept volume computation by inverting motion perspective and using a multi-field tetrahedral framework, which robustly handles complex motions while preserving geometric details for applications in path planning and collision detection.

Authors:Biwei Yan, Yue Zhang, Minghui Xu, Runyu Pan, Jinku Li, Xiuzhen Cheng
Title: What You Code Is What We Prove: Translating BLE App Logic into Formal Models with LLMs for Vulnerability Detection
Abstract:
The application layer of Bluetooth Low Energy (BLE) is a growing source of security vulnerabilities, as developers often neglect to implement critical protections such as encryption, authentication, and freshness. While formal verification offers a principled way to check these properties, the manual effort of constructing formal models makes it impractical for large-scale analysis. This paper introduces a key insight: BLE application security analysis can be reframed as a semantic translation problem, i.e., from real-world code to formal models. We leverage large language models (LLMs) not to directly detect vulnerabilities, but to serve as translators that convert BLE-specific code into process models verifiable by tools like ProVerif. We implement this idea in VerifiaBLE, a system that combines static analysis, prompt-guided LLM translation, and symbolic verification to check three core security features: encryption, randomness, and authentication. Applied to 1,050 Android BLE apps, VerifiaBLE uncovers systemic weaknesses: only 10.2\% of apps implement all three protections, while 53.9\% omit them entirely. Our work demonstrates that using LLMs as structured translators can lower the barrier to formal methods, unlocking scalable verification across security-critical domains.
中文: 该研究将BLE应用安全分析重构为语义翻译问题,利用大语言模型将代码转换为可验证模型,并通过VerifiaBLE系统发现大多数安卓BLE应用缺失关键安全防护。
English: This paper reframes BLE application security analysis as a semantic translation task, using LLMs to convert code into verifiable models and revealing through VerifiaBLE that most Android BLE apps lack essential protections.

Authors:Minghui Li, Hao Zhang, Yechao Zhang, Wei Wan, Shengshan Hu, pei Xiaobing, Jing Wang
Title: Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling
Abstract:
Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.
中文: 本文提出了一种基于激活引导的黑盒提示注入攻击框架,通过能量模型和MCMC采样实现了卓越的跨模型迁移能力,在多种大语言模型和场景中均展现出较高的攻击成功率。
English: This paper introduces an activations-guided black-box prompt injection attack framework that uses an Energy-based Model and MCMC sampling to achieve superior cross-model transferability, demonstrating high attack success rates across multiple LLMs and scenarios.

Authors:Huu Hung Nguyen, Anh Tuan Nguyen, Thanh Le-Cong, Yikun Li, Han Wei Ang, Yide Yin, Frank Liauw, Shar Lwin Khin, Ouh Eng Lieh, Ting Zhang, David Lo
Title: PatchSeeker: Mapping NVD Records to their Vulnerability-fixing Commits with LLM Generated Commits and Embeddings
Abstract:
Software vulnerabilities pose serious risks to modern software ecosystems. While the National Vulnerability Database (NVD) is the authoritative source for cataloging these vulnerabilities, it often lacks explicit links to the corresponding Vulnerability-Fixing Commits (VFCs). VFCs encode precise code changes, enabling vulnerability localization, patch analysis, and dataset construction. Automatically mapping NVD records to their true VFCs is therefore critical. Existing approaches have limitations as they rely on sparse, often noisy commit messages and fail to capture the deep semantics in the vulnerability descriptions. To address this gap, we introduce PatchSeeker, a novel method that leverages large language models to create rich semantic links between vulnerability descriptions and their VFCs. PatchSeeker generates embeddings from NVD descriptions and enhances commit messages by synthesizing detailed summaries for those that are short or uninformative. These generated messages act as a semantic bridge, effectively closing the information gap between natural language reports and low-level code changes. Our approach PatchSeeker achieves 59.3% higher MRR and 27.9% higher Recall@10 than the best-performing baseline, Prospector, on the benchmark dataset. The extended evaluation on recent CVEs further confirms PatchSeeker's effectiveness. Ablation study shows that both the commit message generation method and the selection of backbone LLMs make a positive contribution to PatchSeeker. We also discuss limitations and open challenges to guide future work.
中文:PatchSeeker提出了一种利用大型语言模型的新方法,有效弥合了NVD漏洞描述与漏洞修复提交之间的语义鸿沟,在准确率和召回率指标上显著优于现有方法。
English: PatchSeeker introduces a novel approach using large language models to bridge the semantic gap between NVD vulnerability descriptions and vulnerability-fixing commits, significantly outperforming existing methods in accuracy and recall metrics.

Authors:Kun Li, Cheng Wang, Minghui Xu, Yue Zhang, Xiuzhen Cheng
Title: Dataset Ownership in the Era of Large Language Models
Abstract:
As datasets become critical assets in modern machine learning systems, ensuring robust copyright protection has emerged as an urgent challenge. Traditional legal mechanisms often fail to address the technical complexities of digital data replication and unauthorized use, particularly in opaque or decentralized environments. This survey provides a comprehensive review of technical approaches for dataset copyright protection, systematically categorizing them into three main classes: non-intrusive methods, which detect unauthorized use without modifying data; minimally-intrusive methods, which embed lightweight, reversible changes to enable ownership verification; and maximally-intrusive methods, which apply aggressive data alterations, such as reversible adversarial examples, to enforce usage restrictions. We synthesize key techniques, analyze their strengths and limitations, and highlight open research challenges. This work offers an organized perspective on the current landscape and suggests future directions for developing unified, scalable, and ethically sound solutions to protect datasets in increasingly complex machine learning ecosystems.
中文摘要:本文综述了数据集版权保护的技术方法,将其分为非侵入式、轻度侵入式和重度侵入式三类,系统分析了各类方法的优劣,并指出了未来开发统一、可扩展且符合伦理的解决方案的研究方向。
English Summary: This survey systematically reviews technical approaches for dataset copyright protection by categorizing them into non-intrusive, minimally-intrusive, and maximally-intrusive methods, while analyzing their capabilities and identifying future research directions for ethical and scalable solutions.

Authors:Xiangyu Liu, Haodi Lei, Yi Liu, Yang Liu, Wei Hu
Title: ProtSAE: Disentangling and Interpreting Protein Language Models via Semantically-Guided Sparse Autoencoders
Abstract:
Sparse Autoencoder (SAE) has emerged as a powerful tool for mechanistic interpretability of large language models. Recent works apply SAE to protein language models (PLMs), aiming to extract and analyze biologically meaningful features from their latent spaces. However, SAE suffers from semantic entanglement, where individual neurons often mix multiple nonlinear concepts, making it difficult to reliably interpret or manipulate model behaviors. In this paper, we propose a semantically-guided SAE, called ProtSAE. Unlike existing SAE which requires annotation datasets to filter and interpret activations, we guide semantic disentanglement during training using both annotation datasets and domain knowledge to mitigate the effects of entangled attributes. We design interpretability experiments showing that ProtSAE learns more biologically relevant and interpretable hidden features compared to previous methods. Performance analyses further demonstrate that ProtSAE maintains high reconstruction fidelity while achieving better results in interpretable probing. We also show the potential of ProtSAE in steering PLMs for downstream generation tasks.
中文:稀疏自编码器在蛋白质语言模型中存在语义纠缠问题,而提出的ProtSAE方法利用标注数据和领域知识在训练过程中引导语义解耦,从而获得更具生物学可解释性的特征,同时保持高重建保真度和下游任务性能。
English: Sparse Autoencoder (SAE) faces semantic entanglement issues in protein language models, but the proposed ProtSAE method uses annotation data and domain knowledge to guide semantic disentanglement during training, resulting in more biologically interpretable features while maintaining high reconstruction fidelity and performance in downstream tasks.

Authors:Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Fuli Feng
Title: Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
Abstract:
Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.
中文摘要:T2I-CoReBench是一个通过1080个复杂提示和详细检查表评估文本到图像模型组合与推理能力的综合基准,揭示了现有模型在处理高密度组合和隐含推理方面的明显不足。
English Summary: T2I-CoReBench is a comprehensive benchmark designed to evaluate text-to-image models' composition and reasoning capabilities through 1,080 complex prompts and detailed checklists, revealing current models' limitations in handling dense compositions and implicit reasoning.

Authors:Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Fuli Feng
Title: Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?
Abstract:
Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, which thus correspond to two core capabilities: composition and reasoning. Despite recent advances of T2I models in both composition and reasoning, existing benchmarks remain limited in evaluation. They not only fail to provide comprehensive coverage across and within both capabilities, but also largely restrict evaluation to low scene density and simple one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent real-world complexities, we curate each prompt with higher compositional density for composition and greater reasoning intensity for reasoning. To facilitate fine-grained and reliable evaluation, we also pair each evaluation prompt with a checklist that specifies individual yes/no questions to assess each intended element independently. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 28 current T2I models reveal that their composition capability still remains limited in high compositional scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts.
中文摘要:T2I-CoReBench是一个通过1080个复杂提示和详细检查表评估文本到图像模型组合与推理能力的综合基准,揭示了现有模型在处理高密度组合和隐含推理方面的明显不足。
English Summary: T2I-CoReBench is a comprehensive benchmark designed to evaluate text-to-image models' composition and reasoning capabilities through 1,080 complex prompts and detailed checklists, revealing current models' limitations in handling dense compositions and implicit reasoning.

Authors:Sashuai Zhou, Weinan Gan, Qijiong Liu, Ke Lei, Jieming Zhu, Hai Huang, Yan Xia, Ruiming Tang, Zhenhua Dong, Zhou Zhao
Title: RecBase: Generative Foundation Model Pretraining for Zero-Shot Recommendation
Abstract:
Recent advances in LLM-based recommendation have shown promise, yet their cross-domain generalization is hindered by a fundamental mismatch between language-centric pretraining and the recommendation task. Existing methods, relying on language-level knowledge, fail to capture dynamic, item-level user interests across domains. To bridge this gap, we propose RecBase, a domain-agnostic foundational model pretrained with a recommendation-oriented objective. RecBase leverages a large-scale, heterogeneous, cross-domain corpus with unified textual representations and feature mappings to enhance cross-domain generalization. To further align item semantics across domains, we introduce a unified item tokenizer that encodes items into hierarchical concept identifiers, enabling structured representation and efficient vocabulary sharing. The model is trained using an autoregressive objective to capture complex item-level sequential patterns. On eight real-world datasets, our 1.5B-parameter model matches or surpasses the performance of LLM baselines up to 7B parameters in zero-shot and cross-domain recommendation tasks.
中文: RecBase通过引入领域无关的基础模型、统一项目分词器和自回归训练,解决了语言中心预训练与推荐任务之间的不匹配问题,在跨域场景中以更少参数实现了与大型模型相媲美的性能。
English: RecBase addresses the mismatch between language-centric pretraining and recommendation tasks by introducing a domain-agnostic foundational model with a unified item tokenizer and autoregressive training, achieving competitive performance in cross-domain scenarios with significantly fewer parameters.

Authors:Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Jian Guo, Yuanzhuo Wang
Title: JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer
Abstract:
Current evaluation paradigms for large language models (LLMs) suffer from overestimated or biased evaluation and mismatched question difficulty, leading to incomplete evaluations of LLM's knowledge and capability boundaries, which hinder LLM's effective application and optimization. To address these challenges, we propose Agent-as-Interviewer, a dynamic evaluation paradigm that employs LLM agents to conduct multi-turn interactions for evaluation. Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to call knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation, achieving more complete evaluations of the LLM's knowledge boundaries. It also leverages agents to plan query strategies for adjustment of the question difficulty levels, enhancing the difficulty control to match the actual capabilities of target LLMs. Based on this paradigm, we develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent's tool, and uses difficulty scoring as strategy guidance, thereby finally providing valuable suggestions to help targets optimize themselves. Extensive experiments validate the effectiveness of JudgeAgent's suggestions, demonstrating that Agent-as-Interviewer can accurately identify the knowledge and capability boundaries of target models. The source code is available on https://anonymous.4open.science/r/JudgeAgent.
中文:Agent-as-Interviewer范式通过AI代理进行动态多轮交互和问题难度调节,解决了当前大语言模型评估的局限性,能更准确地识别知识边界并提供优化建议。
English: The Agent-as-Interviewer paradigm addresses limitations in current LLM evaluations by using AI agents to conduct dynamic multi-turn interactions and adjust question difficulty, enabling more accurate identification of knowledge boundaries and providing optimization suggestions.

Authors:Runze Liu, Jiakang Wang, Yuling Shi, Zhihui Xie, Chenxin An, Kaiyan Zhang, Jian Zhao, Xiaodong Gu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai
Title: Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
Abstract:
Reinforcement Learning (RL) has shown remarkable success in enhancing the reasoning capabilities of Large Language Models (LLMs). Process-Supervised RL (PSRL) has emerged as a more effective paradigm compared to outcome-based RL. However, existing PSRL approaches suffer from limited exploration efficiency, both in terms of branching positions and sampling. In this paper, we introduce a novel PSRL framework (AttnRL), which enables efficient exploration for reasoning models. Motivated by preliminary observations that steps exhibiting high attention scores correlate with reasoning behaviors, we propose to branch from positions with high values. Furthermore, we develop an adaptive sampling strategy that accounts for problem difficulty and historical batch size, ensuring that the whole training batch maintains non-zero advantage values. To further improve sampling efficiency, we design a one-step off-policy training pipeline for PSRL. Extensive experiments on multiple challenging mathematical reasoning benchmarks demonstrate that our method consistently outperforms prior approaches in terms of performance and sampling and training efficiency.
中文: 本文提出AttnRL这一新颖的过程监督强化学习框架,通过从高注意力位置分支探索并结合自适应采样与单步离策略训练,显著提升推理模型的效率,在多项数学推理基准测试中持续超越现有方法。
English: This paper introduces AttnRL, a novel process-supervised reinforcement learning framework that enhances reasoning models through efficient exploration by branching from high-attention positions and implementing adaptive sampling with one-step off-policy training, consistently outperforming prior methods across mathematical reasoning benchmarks.

Authors:Yunli Li, Xiaoming Shi, Xiaodan Shao, Jie Xu, Rui Zhang
Title: Flexible-Sector 6DMA Base Station: Modeling and Design
Abstract:
Six-dimensional movable antenna (6DMA) has emerged as a promising new technology for future wireless networks, which can adaptively adjust the three-dimensional (3D) positions and 3D rotations of antennas/antenna arrays for performance enhancement. This paper proposes a novel cost-effective 6DMA-based base station (BS) architecture, termed the \textit{flexible-sector} BS, which allows the deployed antennas to flexibly rotate and move along a circular track, thus enabling common sector rotation and flexible antenna allocation across sectors to adapt to the spatial user distribution efficiently. In particular, we focus on the uplink transmission in a single-cell system, where the flexible-sector BS receives independent messages from multiple users. We introduce an angular-domain user distribution model, which captures the users' spatial clustering or hot-spot distribution effectively. Assuming the zero-forcing (ZF) based receiver applied at the BS to decode multiuser signals, we derive the average sum rate achievable for the users as a function of the common rotation of sectors and the antenna allocation over them. Moreover, we develop a two-step algorithm to jointly optimize the common sector rotation and antenna allocation to maximize the average sum rate of all users. It is shown that the optimal antenna number in each sector linearly increases with the number of users in it. It is also revealed that under the most favorable user distribution, the achievable sum rate gain increases in the order of $\log_{2}(B)$ in the regime of asymptotically large number of antennas, where $B$ denotes the number of sectors. Numerically results also show that as $B$ increases, the proposed flexible-sector BS achieves higher sum rate, and it outperforms other benchmark schemes, such as the traditional fixed-sector BS as well as the BS with sector rotation or antenna allocation optimization only.
中文摘要:本文提出了一种基于六维可移动天线的灵活扇区基站架构,通过联合优化扇区公共旋转和天线分配来提升上行链路性能,相比传统方案实现了显著的和速率增益。
English Summary: The paper introduces a flexible-sector base station using 6D movable antenna technology to optimize uplink performance through joint sector rotation and antenna allocation, demonstrating significant sum rate improvements over traditional systems.

Authors:Nicola Messina, Rosario Leonardi, Luca Ciampi, Fabio Carrara, Giovanni Maria Farinella, Fabrizio Falchi, Antonino Furnari
Title: Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations
Abstract:
Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations -- natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., "I am pouring vegetables from the chopping board to the pan"). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.
中文: 本文提出一种利用自然语言叙述作为训练指导的弱监督方法,用于分割第一人称视角视频中手持物体,无需像素级标注即可达到全监督方法50%以上的性能。
English: This paper introduces a weakly supervised method for segmenting in-hand objects in egocentric videos using natural language narrations as training guidance, achieving over 50% of fully supervised performance without pixel-level annotations.

Authors:Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, Yizhou Sun
Title: A Multimodal LLM Approach for Visual Question Answering on Multiparametric 3D Brain MRI
Abstract:
We introduce mpLLM, a prompt-conditioned hierarchical mixture-of-experts (MoE) architecture for visual question answering over multi-parametric 3D brain MRI (mpMRI). mpLLM routes across modality-level and token-level projection experts to fuse multiple interrelated 3D modalities, enabling efficient training without image-report pretraining. To address limited image-text paired supervision, mpLLM integrates a synthetic visual question answering (VQA) protocol that generates medically relevant VQA from segmentation annotations, and we collaborate with medical experts for clinical validation. mpLLM outperforms strong medical VLM baselines by 5.3% on average across multiple mpMRI datasets. Our study features three main contributions: (1) the first clinically validated VQA dataset for 3D brain mpMRI, (2) a novel multimodal LLM that handles multiple interrelated 3D modalities, and (3) strong empirical results that demonstrate the medical utility of our methodology. Ablations highlight the importance of modality-level and token-level experts and prompt-conditioned routing.
中文总结:mpLLM是一种用于三维脑部核磁共振视觉问答的提示条件分层专家混合架构,通过模态级和令牌级专家融合与合成VQA生成,在多个数据集上平均性能超越基线5.3%,并具备临床验证优势。
English Summary: mpLLM is a prompt-conditioned hierarchical MoE architecture for 3D brain MRI visual question answering that integrates modality-level and token-level experts with synthetic VQA generation, achieving 5.3% performance improvement over baselines while providing clinical validation.

Authors:Tianlang Chen, Minkai Xu, Jure Leskovec, Stefano Ermon
Title: RFG: Test-Time Scaling for Diffusion Large Language Model Reasoning with Reward-Free Guidance
Abstract:
Diffusion large language models (dLLMs) have shown great potential in large-scale language modeling, and there is an increasing interest in further improving the capacity to solve complex problems by guiding the reasoning process step by step. Common practice for autoregressive language models typically learns a process reward model with dense annotation for each intermediate step. However, this is challenging for dLLMs where the generation is in an any-order fashion and intermediate states are partially masked sentences. To this end, in this paper, we propose reward-free guidance (RFG), a principled method for guiding the reasoning trajectory of dLLMs without explicit process reward. The key idea of RFG is to parameterize the process reward by log-likelihood ratios of the enhanced and reference dLLMs, where the enhanced model can be easily obtained by any off-the-shelf dLLM that has been post-trained with reinforcement learning (RL) or supervised fine-tuning (SFT). We provide theoretical justification that RFG induces the reward-guided sampling distribution with no additional reward. We conduct comprehensive experiments on four challenging mathematical reasoning and code generation benchmarks using a diverse suite of dLLMs enhanced with various post-training methods. RFG consistently yields significant improvements across all tasks and model types, achieving accuracy gains of up to 9.2%. These findings establish RFG as a general training-free framework that scales test-time reasoning without reliance on external reward models.
Chinese Summary: 本文提出无奖励引导(RFG)方法,通过增强型与参考扩散大语言模型的对数似然比参数化过程奖励,无需外部奖励模型即可提升模型推理能力,在多项任务中实现最高9.2%的准确率提升。
English Summary: The paper introduces reward-free guidance (RFG), a training-free method that enhances diffusion large language models' reasoning by parameterizing process rewards through log-likelihood ratios, achieving up to 9.2% accuracy gains without external reward models.

Authors:Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, Kai Zhang
Title: Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models
Abstract:
In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.
中文: 本研究提出一种将预训练视觉编码器对齐为潜在扩散模型分词器的方法,通过三阶段对齐策略利用语义丰富性提升图像生成效果并加速模型收敛。
English: This study introduces a method to align pretrained visual encoders as tokenizers for latent diffusion models, enhancing image generation by leveraging semantic richness and accelerating convergence through a three-stage alignment strategy.

Authors:Jiayu Liu, Zhenya Huang, Anya Sims, Enhong Chen, Yee Whye Teh, Ning Miao
Title: MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts
Abstract:
The current paradigm for reasoning in large language models (LLMs) involves models "thinking out loud" via a sequence of tokens, known as chain-of-thought (CoT). This approach, while effective, has several significant drawbacks. Firstly, inference requires autoregressive generation of often thousands of CoT tokens, which is slow and computationally expensive. Secondly, it constrains reasoning to the discrete space of tokens, creating an information bottleneck across reasoning steps. Thirdly, it fundamentally entangles reasoning with token generation, forcing LLMs to "think while speaking," which causes potentially short-sighted reasoning. In light of these limitations, we re-imagine reasoning in LLMs and present a new paradigm: MARCOS. In our approach, rather than autoregressively generating tokens, we model reasoning as a hidden Markov chain of continuous, high-dimensional "thoughts". Each reasoning step involves a transition of the internal thoughts, where explicit reasoning steps (which may consist of hundreds of tokens) serve as observable variables, which are windows to peek into the implicit thoughts. Since this latent process is incompatible with the standard supervised learning, we further propose a two-phase variational training scheme. Our experiments on three benchmarks demonstrate that MARCOS outperforms existing continuous reasoning methods and, for the first time, achieves performance comparable to token-based CoT, even surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference. Beyond this, MARCOS offers additional advantages, such as step-level instead of token-level control over randomness, opening significant opportunities for reinforcement learning and reasoning in LLMs.
中文: MARCOS范式将大语言模型的推理重新构想为连续思维的隐马尔可夫链,在实现与基于标记的思维链相当性能的同时,提供了更快的推理速度和额外的控制优势。
English: The MARCOS paradigm reimagines reasoning in LLMs by modeling it as a hidden Markov chain of continuous thoughts, achieving performance comparable to token-based chain-of-thought while offering faster inference and additional control benefits.

Authors:Zeyu Xie, Chenxing Li, Xuenan Xu, Mengyue Wu, Wenfu Wang, Ruibo Fu, Meng Yu, Dong Yu, Yuexian Zou
Title: When Audio Generators Become Good Listeners: Generative Features for Understanding Tasks
Abstract:
This work pioneers the utilization of generative features in enhancing audio understanding. Unlike conventional discriminative features that directly optimize posterior and thus emphasize semantic abstraction while losing fine grained details, audio generation models inherently encode both spatiotemporal perception (capturing local acoustic texture across time and frequency) and semantic prior (knowing what to generate). It motivates us to explore the bridge of these complementary strengths. We provide a systematic investigation of their differences and complementary relationships, and ultimately propose an effective fusion strategy. Experiments across multiple tasks, including sound event classification, tagging, and particularly the fine grained task of audio captioning, demonstrate consistent performance gains. Beyond empirical improvements, this work more importantly introduces a new perspective on audio representation learning, highlighting that generative discriminative complementarity can provide both detailed perception and semantic awareness for audio understanding.
中文摘要:本研究开创性地利用生成特征增强音频理解,通过融合策略有效结合了生成特征的时空感知与语义先验优势,在多项音频任务中实现性能提升,为音频表征学习提供了新视角。
English Summary: This study introduces a novel approach to audio understanding by leveraging generative features, which capture both fine-grained acoustic details and semantic context, and demonstrates their effectiveness through a fusion strategy that consistently improves performance across various audio tasks.

Authors:Xuenan Xu, Jiahao Mei, Zihao Zheng, Ye Tao, Zeyu Xie, Yaoyun Zhang, Haohe Liu, Yuning Wu, Ming Yan, Wen Wu, Chao Zhang, Mengyue Wu
Title: UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
Abstract:
Audio generation, including speech, music and sound effects, has advanced rapidly in recent years. These tasks can be divided into two categories: time-aligned (TA) tasks, where each input unit corresponds to a specific segment of the output audio (e.g., phonemes aligned with frames in speech synthesis); and non-time-aligned (NTA) tasks, where such alignment is not available. Since modeling paradigms for the two types are typically different, research on different audio generation tasks has traditionally followed separate trajectories. However, audio is not inherently divided into such categories, making a unified model a natural and necessary goal for general audio generation. Previous unified audio generation works have adopted autoregressive architectures, while unified non-autoregressive approaches remain largely unexplored. In this work, we propose UniFlow-Audio, a universal audio generation framework based on flow matching. We propose a dual-fusion mechanism that temporally aligns audio latents with TA features and integrates NTA features via cross-attention in each model block. Task-balanced data sampling is employed to maintain strong performance across both TA and NTA tasks. UniFlow-Audio supports omni-modalities, including text, audio, and video. By leveraging the advantage of multi-task learning and the generative modeling capabilities of flow matching, UniFlow-Audio achieves strong results across 7 tasks using fewer than 8K hours of public training data and under 1B trainable parameters. Even the small variant with only ~200M trainable parameters shows competitive performance, highlighting UniFlow-Audio as a potential non-auto-regressive foundation model for audio generation. Code and models will be available at https://wsntxxn.github.io/uniflow_audio.
中文:UniFlow-Audio是一种基于流匹配的统一音频生成框架,通过双重融合机制整合时间对齐和非时间对齐任务,并以少量数据和参数在多模态任务中实现优异性能。
English: UniFlow-Audio is a unified non-autoregressive framework for audio generation that integrates time-aligned and non-time-aligned tasks through a dual-fusion mechanism and achieves strong performance across multiple modalities with minimal data and parameters.

Authors:Guancheng Wan, Lucheng Fu, Haoxin Liu, Yiqiao Jin, Hui Yi Leong, Eric Hanchen Jiang, Hejia Geng, Jinhe Bi, Yunpu Ma, Xiangru Tang, B. Aditya Prakash, Yizhou Sun, Wei Wang
Title: Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE
Abstract:
The performance of Large Language Models (LLMs) hinges on carefully engineered prompts. However, prevailing prompt optimization methods, ranging from heuristic edits and reinforcement learning to evolutionary search, primarily target point-wise accuracy. They seldom enforce paraphrase invariance or searching stability, and therefore cannot remedy this brittleness in practice. Automated prompt search remains brittle: small, semantically preserving paraphrases often cause large performance swings. We identify this brittleness as the textual sharpness of the prompt landscape. In this work, we provide the first formal treatment of textual sharpness in the discrete, semantic space of prompts, together with an operational robustness criterion over a semantic neighborhood; the design is black-box or API-only, requiring no gradients to update the model's parameters. Then we introduce TARE (Textual Sharpness-Aware Evolving), a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases and an outer, robust selection that prefers candidates whose neighborhoods remain strong. We further propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity. Diverse tasks evaluate our methods, whose design for minimizing textual sharpness gap leads to prompts that preserve accuracy under paraphrasing, outperforming accuracy-only prompt search while remaining computationally practical.
中文摘要:本研究提出了TARE和ATARE两种无需导数的框架,旨在减少大型语言模型中提示的文本尖锐性,确保在保持高精度的同时增强对语义改写的鲁棒性,并在多种任务中表现出色。
English Summary: The study introduces TARE and ATARE, derivative-free frameworks designed to minimize the textual sharpness of prompts in Large Language Models, ensuring robustness against paraphrasing while maintaining high accuracy across diverse tasks.

Authors:Zijie Meng, Jin Hao, Xiwei Dai, Yang Feng, Jiaxiang Liu, Bin Feng, Huikai Wu, Xiaotang Gai, Hengchuan Zhu, Tianxiang Hu, Yangyang Wu, Hongxia Xu, Jin Li, Jun Xiao, Xiaoqiang Liu, Joey Tianyi Zhou, Fudong Zhu, Zhihe Zhao, Lunguo Xia, Bing Fang, Jimeng Sun, Jian Wu, Zuozhu Liu
Title: DentVLM: A Multimodal Vision-Language Model for Comprehensive Dental Diagnosis and Enhanced Clinical Practice
Abstract:
Diagnosing and managing oral diseases necessitate advanced visual interpretation across diverse imaging modalities and integrated information synthesis. While current AI models excel at isolated tasks, they often fall short in addressing the complex, multimodal requirements of comprehensive clinical dental practice. Here we introduce DentVLM, a multimodal vision-language model engineered for expert-level oral disease diagnosis. DentVLM was developed using a comprehensive, large-scale, bilingual dataset of 110,447 images and 2.46 million visual question-answering (VQA) pairs. The model is capable of interpreting seven 2D oral imaging modalities across 36 diagnostic tasks, significantly outperforming leading proprietary and open-source models by 19.6% higher accuracy for oral diseases and 27.9% for malocclusions. In a clinical study involving 25 dentists, evaluating 1,946 patients and encompassing 3,105 QA pairs, DentVLM surpassed the diagnostic performance of 13 junior dentists on 21 of 36 tasks and exceeded that of 12 senior dentists on 12 of 36 tasks. When integrated into a collaborative workflow, DentVLM elevated junior dentists' performance to senior levels and reduced diagnostic time for all practitioners by 15-22%. Furthermore, DentVLM exhibited promising performance across three practical utility scenarios, including home-based dental health management, hospital-based intelligent diagnosis and multi-agent collaborative interaction. These findings establish DentVLM as a robust clinical decision support tool, poised to enhance primary dental care, mitigate provider-patient imbalances, and democratize access to specialized medical expertise within the field of dentistry.
中文: DentVLM是一种多模态视觉语言模型,能实现专家级口腔疾病诊断,其性能超越现有模型,并能提升临床效率、普及专业牙科知识。
English: DentVLM is a multimodal vision-language model that achieves expert-level oral disease diagnosis, outperforming both proprietary and open-source models while enhancing clinical efficiency and democratizing dental expertise.

Authors:Xiangchen Song, Jiaqi Sun, Zijian Li, Yujia Zheng, Kun Zhang
Title: LLM Interpretability with Identifiable Temporal-Instantaneous Representation
Abstract:
Despite Large Language Models' remarkable capabilities, understanding their internal representations remains challenging. Mechanistic interpretability tools such as sparse autoencoders (SAEs) were developed to extract interpretable features from LLMs but lack temporal dependency modeling, instantaneous relation representation, and more importantly theoretical guarantees, undermining both the theoretical foundations and the practical confidence necessary for subsequent analyses. While causal representation learning (CRL) offers theoretically grounded approaches for uncovering latent concepts, existing methods cannot scale to LLMs' rich conceptual space due to inefficient computation. To bridge the gap, we introduce an identifiable temporal causal representation learning framework specifically designed for LLMs' high-dimensional concept space, capturing both time-delayed and instantaneous causal relations. Our approach provides theoretical guarantees and demonstrates efficacy on synthetic datasets scaled to match real-world complexity. By extending SAE techniques with our temporal causal framework, we successfully discover meaningful concept relationships in LLM activations. Our findings show that modeling both temporal and instantaneous conceptual relationships advances the interpretability of LLMs.
中文: 本文提出了一种时序因果表征学习框架,通过捕捉概念间的时间延迟和瞬时因果关系来增强大型语言模型的可解释性,该框架不仅提供理论保证,还在复杂数据集上验证了有效性。
English: This paper introduces a temporal causal representation learning framework that enhances the interpretability of large language models by capturing both time-delayed and instantaneous causal relationships between concepts, providing theoretical guarantees and demonstrating effectiveness on complex datasets.

Authors:Ye Chen, Zichen Zhou, Jianyu Dou, Te Cui, Yi Yang, Yufeng Yue
Title: GLUE: Global-Local Unified Encoding for Imitation Learning via Key-Patch Tracking
Abstract:
In recent years, visual representation learning has gained widespread attention in robotic imitation learning. However, in complex Out-of-Distribution(OOD) settings characterized by clutter and occlusion, the attention of global visual representations can be diluted or interfered, leading to degraded policy performance. The invariance of local representations for task-relevant objects offers a solution. By efficiently utilizing these local representations, training and testing data can be mapped to a more similar feature space, thereby mitigating the covariate shift problem. Accordingly, we propose GLUE, a global-local unified encoding framework for imitation learning based on key-patch tracking. GLUE selects and tracks key-patches as critical local representations by employing a text-guided mechanism. It features a novel fusion framework where global patch features query local patches to distill essential information, yielding fine-grained local features with low heterogeneity relative to the global context. This fused representation steers the robot's visual attention toward task-relevant objects and preserves precise global context, which together align the training and testing distributions into a similar and task-informative feature space, ultimately enhancing the robustness of the imitation learning policy. Experiments demonstrate that GLUE achieves strong performance across diverse tasks in both simulation and real-world settings, outperforming the strongest baseline by 17.6% in simulation, 36.3% in real-world environments, and 58.3% on real-world generalization settings. The project website of GLUE is available at https://GLUE666.github.io/.
中文: 提出的GLUE框架通过关键补丁跟踪整合全局与局部视觉表征,有效解决复杂分布外场景中的性能下降问题,在仿真和现实世界测试中均实现了显著性能提升。
English: The proposed GLUE framework enhances robotic imitation learning by integrating global and local visual representations through key-patch tracking, effectively addressing performance degradation in complex out-of-distribution scenarios and achieving significant improvements in simulation and real-world tests.

Authors:Guancheng Wan, Leixin Sun, Longxu Dou, Zitong Shi, Fang Wu, Eric Hanchen Jiang, Wenke Huang, Guibin Zhang, Hejia Geng, Xiangru Tang, Zhenfei Yin, Yizhou Sun, Wei Wang
Title: Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts
Abstract:
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks. However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts (system-user, peer-peer), where agents misprioritize system-level rules in the presence of competing demands. Moreover, widely used macro-level metrics (e.g., pass@k) obscure these micro-level violations and offer little actionable guidance for remedy. In this work, we present a full-stack, three-stage framework: (1) Diagnose - Contextualized Role Adherence Score (CRAS), a query-wise, context-aware scoring metric that decomposes role adherence into four measurable dimensions; (2) Localize - attention drift analysis revealing that instruction conflicts are resolved by attention heads that are largely concentrated in middle layers; (3) Align - Surgical Alignment of Instruction Layers (SAIL), which installs LoRA only on the localized focal layers and optimizes a token-weighted DPO-style preference objective that credits tokens by their focal attentional contribution. Across standard benchmarks and MAS frameworks, our surgical approach improves instruction hierarchy compliance (e.g., +5.60% with AutoGen on MedQA) without full-model finetuning.
中文摘要:本研究提出一个三阶段框架,通过诊断、定位和对齐操作来解决多智能体系统中的指令层级遵从失效问题,采用精准干预方法在无需全模型微调的情况下提升系统合规性。
English Summary: The study introduces a three-stage framework to diagnose, localize, and align hierarchical compliance failures in LLM-powered multi-agent systems, using targeted interventions that improve instruction adherence without full-model fine-tuning.

Authors:Bing Liu, Wenqiang Yv, Xuzheng Yang, Shichang Wang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen
Title: GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions
Abstract:
AI-driven geometric problem solving is a complex vision-language task that requires accurate diagram interpretation, mathematical reasoning, and robust cross-modal grounding. A foundational yet underexplored capability for this task is the ability to identify and interpret geometric elements based on natural language queries. To address this, we introduce the task of Referring Expression Comprehension (REC) for geometric problems, which evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts. We present GeoRef, a benchmark dataset constructed from existing geometric problem corpora, featuring diverse, high-quality annotations and queries. Due to the lack of annotated data for this task, we generate a large-scale synthetic training dataset using a structured geometric formal language, enabling broad coverage of geometric concepts and facilitating model adaptation. We explore two fine-tuning approaches: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Our results show that GRPO significantly outperforms SFT by better aligning model behavior with task-specific rewards. Furthermore, we propose a verify-and-regenerate mechanism that detects incorrect predictions and re-infers answers using contextual reasoning history, further boosting accuracy. Notably, even state-of-the-art Multimodal Large Language Models (MLLMs) struggle with this task, underscoring the necessity of explicitly evaluating and strengthening geometric grounding as a prerequisite for robust geometric problem solving. Moreover, models trained on GeoRef demonstrate measurable improvements on downstream geometric reasoning tasks, highlighting the broader value of REC as a foundation for multimodal mathematical understanding.
中文摘要:本文提出GeoRef这一几何图表指代表达理解基准,通过组相对策略优化方法显著提升模型对几何元素的定位与推理能力,并验证其对于增强多模态数学理解的奠基性价值。
English Summary: This paper introduces GeoRef, a benchmark for Referring Expression Comprehension in geometric diagrams, and demonstrates that the Group Relative Policy Optimization method significantly enhances model performance by improving geometric element localization and reasoning accuracy.

Authors:Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
Title: CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Abstract:
Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}ontrolling \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
Chinese: 强化学习优化大语言模型处理复杂推理任务,而提出的CE-GPPO算法通过保留被裁剪标记的梯度来改善探索与利用的平衡,在数学推理基准测试中始终优于现有方法。
English: Reinforcement learning optimizes large language models for complex reasoning, and the proposed CE-GPPO algorithm enhances this by preserving gradients from clipped tokens to better balance exploration and exploitation, outperforming existing methods in mathematical reasoning tasks.

Authors:Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
Title: CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
Abstract:
Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose \textbf{C}oordinating \textbf{E}ntropy via \textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.
Chinese: 强化学习优化大语言模型处理复杂推理任务,而提出的CE-GPPO算法通过保留被裁剪标记的梯度来改善探索与利用的平衡,在数学推理基准测试中始终优于现有方法。
English: Reinforcement learning optimizes large language models for complex reasoning, and the proposed CE-GPPO algorithm enhances this by preserving gradients from clipped tokens to better balance exploration and exploitation, outperforming existing methods in mathematical reasoning tasks.

Authors:Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C. Karen Liu, Jiajun Wu
Title: VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation
Abstract:
Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: https://visualmimic.github.io .
中文摘要:VisualMimic是一个将自我中心视觉与分层全身控制相结合的视觉模拟到现实框架,通过从仿真到实体的零样本策略迁移,使人形机器人能够完成多种移动操作任务。
English Summary: VisualMimic is a visual sim-to-real framework that integrates egocentric vision with hierarchical whole-body control, enabling humanoid robots to perform diverse loco-manipulation tasks through zero-shot policy transfer from simulation to reality.

Authors:Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau
Title: Semantic Representation Attack against Aligned Large Language Models
Abstract:
Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, such as ``Sure, here is...'', suffering from limited convergence, unnatural prompts, and high computational costs. We introduce Semantic Representation Attack, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space comprising diverse responses with equivalent harmful meanings. This innovation resolves the inherent trade-off between attack efficacy and prompt naturalness that plagues existing methods. The Semantic Representation Heuristic Search algorithm is proposed to efficiently generate semantically coherent and concise adversarial prompts by maintaining interpretability during incremental expansion. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that our method achieves unprecedented attack success rates (89.41\% averaged across 18 LLMs, including 100\% on 11 models) while maintaining stealthiness and efficiency. Comprehensive experimental results confirm the overall superiority of our Semantic Representation Attack. The code will be publicly available.
Chinese: 语义表征攻击范式通过针对有害响应的语义空间而非精确文本模式来重新定义对抗目标,在保持提示自然性和效率的同时实现了前所未有的攻击成功率。
English: The Semantic Representation Attack paradigm redefines adversarial objectives by targeting the semantic space of harmful responses rather than exact text patterns, achieving unprecedented success rates while maintaining natural prompts and efficiency.

Authors:Zhenyu Tao, Wei Xu, Xiaohu You
Title: A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications
Abstract:
The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
中文: 广义互模拟度量(GBSM)将状态相似性度量严格扩展到成对马尔可夫决策过程,为策略迁移和状态聚合提供了更紧的理论界限,同时改进了样本复杂度的保证。
English: The generalized bisimulation metric (GBSM) rigorously extends state similarity measures to pairs of Markov decision processes, enabling tighter theoretical bounds for policy transfer and state aggregation while providing improved sample complexity guarantees.

Authors:Jon Crowcroft, Anil Madhavapeddy, Chris Hicks, Richard Mortier, Vasilios Mavroudis
Title: What if we could hot swap our Biometrics?
Abstract:
What if you could really revoke your actual biometric identity, and install a new one, by live rewriting your biological self? We propose some novel mechanisms for hot swapping identity based in novel biotechnology. We discuss the potential positive use cases, and negative consequences if such technology was to become available and affordable. Biometrics are selected on the basis that they are supposed to be unfakeable, or at least not at reasonable cost. If they become easier to fake, it may be much cheaper to fake someone else's biometrics than it is for you to change your own biometrics if someone does copy yours. This potentially makes biometrics a bad trade-off for the user. At the time of writing, this threat is highly speculative, but we believe it is worth raising and considering the potential consequences.
中文: 摘要探讨了利用生物技术热切换生物特征身份这一假设性概念,既指出了其潜在益处,也警示若生物特征易于更改,伪造他人特征的成本可能更低,从而削弱其对用户的安全优势。
English: The abstract explores the speculative concept of using biotechnology to hot-swap biometric identities, highlighting both potential benefits and the risk that making biometrics easily changeable could also make them cheaper to fake, thus undermining their security advantage for users.

Authors:Jun Ling, Yao Qi, Tao Huang, Shibo Zhou, Yanqin Huang, Jiang Yang, Ziqi Song, Ying Zhou, Yang Yang, Heng Tao Shen, Peng Wang
Title: Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models
Abstract:
In this work, we address the task of table image to LaTeX code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs. A central challenge of this task lies in accurately handling complex tables -- those with large sizes, deeply nested structures, and semantically rich or irregular cell content -- where existing methods often fail. We begin with a comprehensive analysis, identifying key challenges and highlighting the limitations of current evaluation protocols. To overcome these issues, we propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-LaTeX dataset. To further improve generation quality, we introduce a dual-reward reinforcement learning strategy based on Group Relative Policy Optimization (GRPO). Unlike standard approaches that optimize purely over text outputs, our method incorporates both a structure-level reward on LaTeX code and a visual fidelity reward computed from rendered outputs, enabling direct optimization of the visual output quality. We adopt a hybrid evaluation protocol combining TEDS-Structure and CW-SSIM, and show that our method achieves state-of-the-art performance, particularly on structurally complex tables, demonstrating the effectiveness and robustness of our approach.
中文: 本研究提出了一种强化多模态大语言模型框架,通过双奖励强化学习策略从表格图像生成高质量LaTeX代码,在混合评估中尤其在复杂表格上实现了最优性能。
English: This study introduces a reinforced multimodal large language model framework with a dual-reward reinforcement learning strategy to generate high-quality LaTeX code from table images, achieving state-of-the-art performance especially on complex tables through hybrid evaluation.

Authors:Yao Wu, Ziye Jia, Qihui Wu, Yian Zhu
Title: A Lightweight Authentication and Key Agreement Protocol Design for FANET
Abstract:
The advancement of low-altitude intelligent networks enables unmanned aerial vehicle (UAV) interconnection via flying ad-hoc networks (FANETs), offering flexibility and decentralized coordination. However, resource constraints, dynamic topologies, and UAV operations in open environments present significant security and communication challenges. Existing multi-factor and public-key cryptography protocols are vulnerable due to their reliance on stored sensitive information, increasing the risk of exposure and compromise. This paper proposes a lightweight authentication and key agreement protocol for FANETs, integrating physical unclonable functions with dynamic credential management and lightweight cryptographic primitives. The protocol reduces computational and communication overhead while enhancing security. Security analysis confirms its resilience against various attacks, and comparative evaluations demonstrate its superiority in security, communication efficiency, and computational cost.
中文摘要:本文提出了一种用于FANET的轻量级认证与密钥协商协议,通过结合物理不可克隆功能和动态凭证管理,在降低计算与通信开销的同时显著提升了系统安全性。
English Summary: This paper introduces a lightweight authentication and key agreement protocol for FANETs that integrates physical unclonable functions and dynamic credential management to enhance security while reducing computational and communication overhead.

Authors:Zeyu Xie, Xuenan Xu, Yixuan Li, Mengyue Wu, Yuexian Zou
Title: STAR: Speech-to-Audio Generation via Representation Learning
Abstract:
This work presents STAR, the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. Unlike prior approaches relying on text or vision, STAR leverages speech as it constitutes a natural modality for interaction. As an initial step to validate the feasibility of the system, we demonstrate through representation learning experiments that spoken sound event semantics can be effectively extracted from raw speech, capturing both auditory events and scene cues. Leveraging the semantic representations, STAR incorporates a bridge network for representation mapping and a two-stage training strategy to achieve end-to-end synthesis. With a 76.9% reduction in speech processing latency, STAR demonstrates superior generation performance over the cascaded systems. Overall, STAR establishes speech as a direct interaction signal for audio generation, thereby bridging representation learning and multimodal synthesis. Generated samples are available at https://zeyuxie29.github.io/STAR.
中文: STAR提出了首个端到端语音到音频生成框架,利用语音作为自然交互模态,通过从原始语音中提取声音事件语义,将处理延迟降低76.9%,性能优于级联系统。
English: STAR introduces the first end-to-end speech-to-audio generation framework, using speech as a natural interaction modality to reduce latency by 76.9% and outperform cascaded systems by capturing sound event semantics directly from raw speech.

Authors:Zeyu Xie, Yaoyun Zhang, Xuenan Xu, Yongkang Yin, Chenxing Li, Mengyue Wu, Yuexian Zou
Title: FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection
Abstract:
The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources-thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.
Chinese: FakeSound2基准测试通过评估模型在多种操作类型和来源上的定位、溯源和泛化能力,揭示了现有系统尽管准确率高,但在识别伪造模式和提供可靠解释方面仍存在不足,旨在推动可信音频认证的发展。
English: The FakeSound2 benchmark addresses limitations in deepfake sound detection by evaluating models on localization, traceability, and generalization across multiple manipulation types and sources, revealing current systems' struggles with forged patterns and reliability despite high accuracy.

Authors:Zeyu Xie, Yaoyun Zhang, Xuenan Xu, Yongkang Yin, Chenxing Li, Mengyue Wu, Yuexian Zou
Title: FakeSound2: A Benchmark for Explainable and Generalizable Deepfake Sound Detection
Abstract:
The rapid development of generative audio raises ethical and security concerns stemming from forged data, making deepfake sound detection an important safeguard against the malicious use of such technologies. Although prior studies have explored this task, existing methods largely focus on binary classification and fall short in explaining how manipulations occur, tracing where the sources originated, or generalizing to unseen sources-thereby limiting the explainability and reliability of detection. To address these limitations, we present FakeSound2, a benchmark designed to advance deepfake sound detection beyond binary accuracy. FakeSound2 evaluates models across three dimensions: localization, traceability, and generalization, covering 6 manipulation types and 12 diverse sources. Experimental results show that although current systems achieve high classification accuracy, they struggle to recognize forged pattern distributions and provide reliable explanations. By highlighting these gaps, FakeSound2 establishes a comprehensive benchmark that reveals key challenges and aims to foster robust, explainable, and generalizable approaches for trustworthy audio authentication.
Chinese: FakeSound2基准测试通过评估模型在多种操作类型和来源上的定位、溯源和泛化能力,揭示了现有系统尽管准确率高,但在识别伪造模式和提供可靠解释方面仍存在不足,旨在推动可信音频认证的发展。
English: The FakeSound2 benchmark addresses limitations in deepfake sound detection by evaluating models on localization, traceability, and generalization across multiple manipulation types and sources, revealing current systems' struggles with forged patterns and reliability despite high accuracy.

Authors:Can Cui, Ziye Jia, Jiahao You, Chao Dong, Qihui Wu, Han Zhu
Title: Robust and Secure Computation Offloading and Trajectory Optimization for Multi-UAV MEC Against Aerial Eavesdropper
Abstract:
The unmanned aerial vehicle (UAV) based multi-access edge computing (MEC) appears as a popular paradigm to reduce task processing latency. However, the secure offloading is an important issue when occurring aerial eavesdropping. Besides, the potential uncertainties in practical applications and flexible trajectory optimizations of UAVs pose formidable challenges for realizing robust offloading. In this paper, we consider the aerial secure MEC network including ground users, service unmanned aerial vehicles (S-UAVs) integrated with edge servers, and malicious UAVs overhearing transmission links. To deal with the task computation complexities, which are characterized as uncertainties, a robust problem is formulated with chance constraints. The energy cost is minimized by optimizing the connections, trajectories of S-UAVs and offloading ratios. Then, the proposed non-linear problem is tackled via the distributionally robust optimization and conditional value-at-risk mechanism, which is further transformed into the second order cone programming forms. Moreover, we decouple the reformulated problem and design the successive convex approximation for S-UAV trajectories. The global algorithm is designed to solve the sub-problems in a block coordinate decent manner. Finally, extensive simulations and numerical analyses are conducted to verify the robustness of the proposed algorithms, with just 2\% more energy cost compared with the ideal circumstance.
中文: 基于无人机的多接入边缘计算虽能降低任务延迟,但面临空中窃听和实际应用不确定性的安全挑战,为此采用鲁棒优化方法,通过优化服务无人机轨迹和任务卸载比例来最小化能耗并保障安全。
English: UAV-based multi-access edge computing (MEC) reduces task latency but faces security risks from aerial eavesdropping and uncertainties, prompting a robust optimization approach that minimizes energy costs through trajectory and offloading adjustments while ensuring security.

Authors:Yuhang Li, Yang Lu, Wei Chen, Bo Ai, Zhiguo Ding, Dusit Niyato
Title: BERT4beam: Large AI Model Enabled Generalized Beamforming Optimization
Abstract:
Artificial intelligence (AI) is anticipated to emerge as a pivotal enabler for the forthcoming sixth-generation (6G) wireless communication systems. However, current research efforts regarding large AI models for wireless communications primarily focus on fine-tuning pre-trained large language models (LLMs) for specific tasks. This paper investigates the large-scale AI model designed for beamforming optimization to adapt and generalize to diverse tasks defined by system utilities and scales. We propose a novel framework based on bidirectional encoder representations from transformers (BERT), termed BERT4beam. We aim to formulate the beamforming optimization problem as a token-level sequence learning task, perform tokenization of the channel state information, construct the BERT model, and conduct task-specific pre-training and fine-tuning strategies. Based on the framework, we propose two BERT-based approaches for single-task and multi-task beamforming optimization, respectively. Both approaches are generalizable for varying user scales. Moreover, the former can adapt to varying system utilities and antenna configurations by re-configuring the input and output module of the BERT model, while the latter, termed UBERT, can directly generalize to diverse tasks, due to a finer-grained tokenization strategy. Extensive simulation results demonstrate that the two proposed approaches can achieve near-optimal performance and outperform existing AI models across various beamforming optimization tasks, showcasing strong adaptability and generalizability.
中文: 本文提出了BERT4beam框架,利用双向编码器表示将波束成形优化转化为令牌级序列学习任务,在多种无线通信任务中实现了接近最优的性能和卓越的适应性。
English: This paper introduces BERT4beam, a novel framework using bidirectional encoder representations from transformers to optimize beamforming by treating it as a token-level sequence learning task, achieving near-optimal performance and superior adaptability across various wireless communication tasks.

Authors:Victor-Alexandru Pădurean, Tung Phung, Nachiket Kotalwar, Michael Liut, Juho Leinonen, Paul Denny, Adish Singla
Title: Humanizing Automated Programming Feedback: Fine-Tuning Generative Models with Student-Written Feedback
Abstract:
The growing need for automated and personalized feedback in programming education has led to recent interest in leveraging generative AI for feedback generation. However, current approaches tend to rely on prompt engineering techniques in which predefined prompts guide the AI to generate feedback. This can result in rigid and constrained responses that fail to accommodate the diverse needs of students and do not reflect the style of human-written feedback from tutors or peers. In this study, we explore learnersourcing as a means to fine-tune language models for generating feedback that is more similar to that written by humans, particularly peer students. Specifically, we asked students to act in the flipped role of a tutor and write feedback on programs containing bugs. We collected approximately 1,900 instances of student-written feedback on multiple programming problems and buggy programs. To establish a baseline for comparison, we analyzed a sample of 300 instances based on correctness, length, and how the bugs are described. Using this data, we fine-tuned open-access generative models, specifically Llama3 and Phi3. Our findings indicate that fine-tuning models on learnersourced data not only produces feedback that better matches the style of feedback written by students, but also improves accuracy compared to feedback generated through prompt engineering alone, even though some student-written feedback is incorrect. This surprising finding highlights the potential of student-centered fine-tuning to improve automated feedback systems in programming education.
中文: 本研究利用学习者贡献的学生反馈数据微调生成模型,相比传统提示工程方法,能生成更接近人类风格且准确性更高的编程自动反馈。
English: This study fine-tunes generative models with learnersourced student feedback to produce automated programming feedback that better mimics human style and improves accuracy compared to conventional prompt engineering methods.

Authors:Haolin Yuan, Jingtao Li, Weiming Zhuang, Chen Chen, Lingjuan Lyu
Title: FEDEXCHANGE: Bridging the Domain Gap in Federated Object Detection for Free
Abstract:
Federated Object Detection (FOD) enables clients to collaboratively train a global object detection model without accessing their local data from diverse domains. However, significant variations in environment, weather, and other domain specific factors hinder performance, making cross domain generalization a key challenge. Existing FOD methods often overlook the hardware constraints of edge devices and introduce local training regularizations that incur high computational costs, limiting real-world applicability. In this paper, we propose FEDEXCHANGE, a novel FOD framework that bridges domain gaps without introducing additional local computational overhead. FEDEXCHANGE employs a server side dynamic model exchange strategy that enables each client to gain insights from other clients' domain data without direct data sharing. Specifically, FEDEXCHANGE allows the server to alternate between model aggregation and model exchange. During aggregation rounds, the server aggregates all local models as usual. In exchange rounds, FEDEXCHANGE clusters and exchanges local models based on distance measures, allowing local models to learn from a variety of domains. As all operations are performed on the server side, clients can achieve improved cross domain utility without any additional computational overhead. Extensive evaluations demonstrate that FEDEXCHANGE enhances FOD performance, achieving 1.6X better mean average precision in challenging domains, such as rainy conditions, while requiring only 0.8X the computational resources compared to baseline methods.
中文: FEDEXCHANGE是一种新颖的联邦目标检测框架,通过服务器端模型交换和聚类克服领域差异,在不增加本地计算成本的情况下提升跨领域性能。
English: FEDEXCHANGE is a novel federated object detection framework that overcomes domain gaps through server-side model exchange and clustering, enhancing cross-domain performance without increasing local computational costs.

Authors:Yinan Deng, Yufeng Yue, Jianyu Dou, Jingyu Zhao, Jiahui Wang, Yujie Tang, Yi Yang, Mengyin Fu
Title: OmniMap: A General Mapping Framework Integrating Optics, Geometry, and Semantics
Abstract:
Robotic systems demand accurate and comprehensive 3D environment perception, requiring simultaneous capture of photo-realistic appearance (optical), precise layout shape (geometric), and open-vocabulary scene understanding (semantic). Existing methods typically achieve only partial fulfillment of these requirements while exhibiting optical blurring, geometric irregularities, and semantic ambiguities. To address these challenges, we propose OmniMap. Overall, OmniMap represents the first online mapping framework that simultaneously captures optical, geometric, and semantic scene attributes while maintaining real-time performance and model compactness. At the architectural level, OmniMap employs a tightly coupled 3DGS-Voxel hybrid representation that combines fine-grained modeling with structural stability. At the implementation level, OmniMap identifies key challenges across different modalities and introduces several innovations: adaptive camera modeling for motion blur and exposure compensation, hybrid incremental representation with normal constraints, and probabilistic fusion for robust instance-level understanding. Extensive experiments show OmniMap's superior performance in rendering fidelity, geometric accuracy, and zero-shot semantic segmentation compared to state-of-the-art methods across diverse scenes. The framework's versatility is further evidenced through a variety of downstream applications, including multi-domain scene Q&A, interactive editing, perception-guided manipulation, and map-assisted navigation.
Chinese: OmniMap是首个实时同步捕获光学、几何和语义场景属性的在线建图框架,采用3DGS-Voxel混合表示,在多种应用中实现了卓越的渲染保真度、几何精度和零样本语义分割性能。
English: OmniMap is the first real-time mapping framework that simultaneously captures optical, geometric, and semantic scene attributes using a hybrid 3DGS-Voxel representation, achieving superior rendering fidelity, geometric accuracy, and zero-shot semantic segmentation across diverse applications.

Authors:Walid El Maouaki, Nouhaila Innan, Alberto Marchisio, Taoufik Said, Muhammad Shafique, Mohamed Bennai
Title: RobQFL: Robust Quantum Federated Learning in Adversarial Environment
Abstract:
Quantum Federated Learning (QFL) merges privacy-preserving federation with quantum computing gains, yet its resilience to adversarial noise is unknown. We first show that QFL is as fragile as centralized quantum learning. We propose Robust Quantum Federated Learning (RobQFL), embedding adversarial training directly into the federated loop. RobQFL exposes tunable axes: client coverage $γ$ (0-100\%), perturbation scheduling (fixed-$\varepsilon$ vs $\varepsilon$-mixes), and optimization (fine-tune vs scratch), and distils the resulting $γ\times \varepsilon$ surface into two metrics: Accuracy-Robustness Area and Robustness Volume. On 15-client simulations with MNIST and Fashion-MNIST, IID and Non-IID conditions, training only 20-50\% clients adversarially boosts $\varepsilon \leq 0.1$ accuracy $\sim$15 pp at $< 2$ pp clean-accuracy cost; fine-tuning adds 3-5 pp. With $\geq$75\% coverage, a moderate $\varepsilon$-mix is optimal, while high-$\varepsilon$ schedules help only at 100\% coverage. Label-sorted non-IID splits halve robustness, underscoring data heterogeneity as a dominant risk.
中文: 量子联邦学习被发现与集中式量子学习同样脆弱,但提出的鲁棒量子联邦学习(RobQFL)通过在联邦过程中嵌入对抗训练,显著提升了对抗鲁棒性,在不同条件下以微小干净精度损失实现了明显的精度提升。
English: Quantum Federated Learning is shown to be as vulnerable as centralized quantum learning, but the proposed Robust Quantum Federated Learning (RobQFL) enhances adversarial robustness by integrating adversarial training into the federated process, achieving significant accuracy improvements with minimal clean-accuracy loss under various conditions.

Authors:Rushi Wang, Jiateng Liu, Cheng Qian, Yifan Shen, Yanzhou Pan, Zhaozhuo Xu, Ahmed Abbasi, Heng Ji, Denghui Zhang
Title: Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts
Abstract:
Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.
中文摘要:本研究揭示大型语言模型在混合语境中会过度关注非主流信息,导致易受不当内容影响,并提出RW-Steering微调方法,通过训练模型内部识别有害信号,将回答质量提升39.8%,实现可靠的内容过滤。
English Summary: This study reveals that large language models disproportionately prioritize less prevalent information in mixed contexts, making them vulnerable to inappropriate content, and introduces RW-Steering, a fine-tuning method that improves response quality by 39.8% by teaching models to internally filter harmful signals.

Authors:Han Chen, Hanchen Wang, Hongmei Chen, Ying Zhang, Lu Qin, Wenjie Zhang
Title: HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis
Abstract:
The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hierarchical structure of software. Existing methods often oversimplify programs into single level graphs, failing to model the crucial semantic relationship between high-level functional interactions and low-level instruction logic. To bridge this gap, we introduce \dataset, the largest public hierarchical graph dataset for malware analysis, comprising over \textbf{200M} Control Flow Graphs (CFGs) nested within \textbf{595K} Function Call Graphs (FCGs). This two-level representation preserves structural semantics essential for building robust detectors resilient to code obfuscation and malware evolution. We demonstrate HiGraph's utility through a large-scale analysis that reveals distinct structural properties of benign and malicious software, establishing it as a foundational benchmark for the community. The dataset and tools are publicly available at https://higraph.org.
中文: HiGraph数据集通过提供包含59.5万个函数调用图中嵌套的2亿多个控制流图,解决了恶意软件分析中缺乏大规模分层数据的问题,为区分良性软件和恶意软件的结构差异提供了有效基准。
English: The HiGraph dataset addresses the lack of large-scale hierarchical data in malware analysis by providing over 200 million nested control flow graphs within 595,000 function call graphs, enabling robust detection of structural differences between benign and malicious software.

Authors:Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, Cihang Xie
Title: OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Abstract:
This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.
中文: 本文介绍了OpenVision 2,通过移除文本编码器和对比损失简化架构,在保持竞争力的性能同时显著减少了训练时间和内存消耗,并成功扩展到超过10亿参数规模。
English: This paper introduces OpenVision 2, a simplified version that removes the text encoder and contrastive loss to enhance training efficiency, achieving competitive performance with significant reductions in training time and memory usage while scaling up to over 1 billion parameters.

Authors:Jianyu Dou, Yinan Deng, Jiahui Wang, Xingsi Tang, Yi Yang, Yufeng Yue
Title: OpenMulti: Open-Vocabulary Instance-Level Multi-Agent Distributed Implicit Mapping
Abstract:
Multi-agent distributed collaborative mapping provides comprehensive and efficient representations for robots. However, existing approaches lack instance-level awareness and semantic understanding of environments, limiting their effectiveness for downstream applications. To address this issue, we propose OpenMulti, an open-vocabulary instance-level multi-agent distributed implicit mapping framework. Specifically, we introduce a Cross-Agent Instance Alignment module, which constructs an Instance Collaborative Graph to ensure consistent instance understanding across agents. To alleviate the degradation of mapping accuracy due to the blind-zone optimization trap, we leverage Cross Rendering Supervision to enhance distributed learning of the scene. Experimental results show that OpenMulti outperforms related algorithms in both fine-grained geometric accuracy and zero-shot semantic accuracy. In addition, OpenMulti supports instance-level retrieval tasks, delivering semantic annotations for downstream applications. The project website of OpenMulti is publicly available at https://openmulti666.github.io/.
中文摘要:OpenMulti是一种开放词汇的多智能体分布式建图框架,通过跨智能体实例对齐和交叉渲染监督提升实例级语义理解能力,在几何精度和零样本语义识别方面表现优异,并能支持下游应用的实例检索任务。
English Summary: OpenMulti is an open-vocabulary multi-agent distributed mapping framework that enhances instance-level semantic understanding through cross-agent instance alignment and cross-rendering supervision, achieving superior geometric and semantic accuracy while supporting downstream applications.

Authors:Kangxiang Xia, Xinfa Zhu, Jixun Yao, Lei Xie
Title: MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech
Abstract:
In recent years, text-to-speech (TTS) has seen impressive advancements through large-scale language models, achieving human-level speech quality. Integrating human feedback has proven effective for enhancing robustness in these systems. However, current approaches face challenges in optimizing TTS with preference data across multiple dimensions and often suffer from performance degradation due to overconfidence in rewards. We propose Multidimensional Preference Optimization (MPO) to better align TTS systems with human preferences. MPO introduces a preference set that streamlines the construction of data for multidimensional preference optimization, enabling alignment with multiple dimensions. Additionally, we incorporate regularization during training to address the typical degradation issues in DPO-based approaches. Our experiments demonstrate MPO's effectiveness, showing significant improvements in intelligibility, speaker similarity, and prosody compared to baseline systems.
中文:针对当前文本转语音系统在多维偏好优化中的挑战和过度自信问题,提出了多维偏好优化方法,通过引入偏好集和正则化技术,显著提升了清晰度、说话人相似性和韵律表现。
English: Recent TTS advancements face challenges in optimizing with multidimensional preference data and overconfidence issues, leading to the proposal of Multidimensional Preference Optimization (MPO) that introduces a preference set and regularization to significantly enhance intelligibility, speaker similarity, and prosody.

Authors:Zihao Zheng, Zeyu Xie, Xuenan Xu, Wen Wu, Chao Zhang, Mengyue Wu
Title: PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description
Abstract:
Controllable text-to-audio generation (TTA) has attracted much attention recently. Although existing works can achieve fine-grained controllability based on timestamp information, sound event categories are limited to a fixed set. Moreover, since only simulated data is used for training, the generated audio quality and generalization performance on real data are limited. To tackle this issue, we propose PicoAudio2, improving temporal-controllable TTA via a new data processing pipeline and model architecture. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, following PicoAudio, we encode timestamp information into a timestamp matrix to provide extra fine-grained time-aligned information to the model, on top of the coarse-grained textual description. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.
中文: PicoAudio2通过整合真实与模拟数据并引入时间戳矩阵,提升了可控文本到音频生成的时间精确性和音质表现。
English: PicoAudio2 enhances controllable text-to-audio generation by combining real and simulated data with a timestamp matrix, improving both temporal accuracy and audio quality.

Authors:Ping Xu, Zaitian Wang, Zhirui Wang, Pengjiang Li, Ran Zhang, Gaoyang Li, Hanyu Xie, Jiajia Wang, Yuanchun Zhou, Pengfei Wang
Title: scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis
Abstract:
Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.
中文:scUnified资源整合了13个跨物种和组织的单细胞RNA测序高质量数据集,通过标准化处理为计算方法提供了无需额外清洗的统一评估基础。
English: The scUnified resource standardizes 13 high-quality single-cell RNA sequencing datasets across species and tissues, enabling reproducible computational analysis without additional preprocessing.

Authors:Mingyu Chen, Jingkai Lin, Zhaojie Chu, Xiaofen Xing, Yirong Chen, Xiangmin Xu
Title: CATCH: A Novel Data Synthesis Framework for High Therapy Fidelity and Memory-Driven Planning Chain of Thought in AI Counseling
Abstract:
Recently, advancements in AI counseling based on large language models have shown significant progress. However, existing studies employ a one-time generation approach to synthesize multi-turn dialogue samples, resulting in low therapy fidelity and failing to capture the decision-making rationale behind each response. In this work, we propose CATCH, a novel data synthesis framework designed to address these challenges. Specifically, to improve therapy fidelity, we introduce the Progressive Dialogue Synthesis strategy, which extracts goals, resources, and solutions from a client's self-report, organizes them into structured outlines, and then incrementally generates stage-aligned counseling dialogues. To capture decision-making rationale behind each response, we propose the Memory-Driven Dynamic Planning thinking pattern that integrates memory enhancement, global planning, and strategy reasoning; a collaborative multi-agent optimizer then leverages MDP to attach explicit chain-of-thought to each dialogue turn. Extensive experiments and human evaluations demonstrate that CATCH significantly enhances fidelity and logical coherence in AI counseling.
Chinese: CATCH框架采用渐进式对话合成与记忆驱动动态规划,通过结构化对话生成与思维链标注,显著提升了AI心理咨询的忠实度与逻辑连贯性。
English: The CATCH framework introduces Progressive Dialogue Synthesis and Memory-Driven Dynamic Planning to enhance therapy fidelity and attach explicit decision-making rationale in AI counseling dialogues, significantly improving their quality and coherence.

Authors:Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Abrham Belete Haile, Grigori Sidorov, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, Vukosi Marivate, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad
Title: The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005-2025)
Abstract:
Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) basic research questions such as: i) How has the nature of NLP evolved over the last two decades?, ii) What are the contributions of AfricaNLP papers?, and iii) Which individuals and organizations (authors, affiliated institutions, and funding bodies) have been involved in the development of AfricaNLP? We quantitatively examine the contributions of AfricaNLP research using 1.9K NLP paper abstracts, 4.9K author contributors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions) along with benchmark results. Our dataset and continuously existing NLP progress tracking website provide a powerful lens for tracing AfricaNLP research trends and hold potential for generating data-driven literature surveys.
中文: 本研究通过量化分析数千篇论文和作者的数据,探讨了非洲自然语言处理研究二十年的演变历程与贡献,为该领域发展提供了重要见解和持续追踪工具。
English: This study analyzes the evolution and contributions of African Natural Language Processing (AfricaNLP) research over two decades using quantitative data from thousands of papers and authors, providing valuable insights and tracking tools for the field's development.

Authors:Penghai Zhao, Jinyu Tian, Qinghua Xing, Xin Zhang, Zheng Li, Jianjun Qian, Ming-Ming Cheng, Xiang Li
Title: NAIPv2: Debiased Pairwise Learning for Efficient Paper Quality Estimation
Abstract:
The ability to estimate the quality of scientific papers is central to how both humans and AI systems will advance scientific knowledge in the future. However, existing LLM-based estimation methods suffer from high inference cost, whereas the faster direct score regression approach is limited by scale inconsistencies. We present NAIPv2, a debiased and efficient framework for paper quality estimation. NAIPv2 employs pairwise learning within domain-year groups to reduce inconsistencies in reviewer ratings and introduces the Review Tendency Signal (RTS) as a probabilistic integration of reviewer scores and confidences. To support training and evaluation, we further construct NAIDv2, a large-scale dataset of 24,276 ICLR submissions enriched with metadata and detailed structured content. Trained on pairwise comparisons but enabling efficient pointwise prediction at deployment, NAIPv2 achieves state-of-the-art performance (78.2% AUC, 0.432 Spearman), while maintaining scalable, linear-time efficiency at inference. Notably, on unseen NeurIPS submissions, it further demonstrates strong generalization, with predicted scores increasing consistently across decision categories from Rejected to Oral. These findings establish NAIPv2 as a debiased and scalable framework for automated paper quality estimation, marking a step toward future scientific intelligence systems. Code and dataset are released at sway.cloud.microsoft/Pr42npP80MfPhvj8.
中文摘要:NAIPv2是一个用于科学论文质量评估的无偏高效框架,通过成对学习和概率性评分整合实现了最先进的性能,同时保持了线性时间推理效率。
English Summary: NAIPv2 is a debiased and efficient framework for scientific paper quality estimation that achieves state-of-the-art performance through pairwise learning and probabilistic score integration while maintaining linear-time inference efficiency.

Authors:Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister
Title: ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Abstract:
With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent's interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise.
中文摘要:ReasoningBank是一种记忆框架,使语言模型智能体能够从过去的成功与失败中学习并泛化推理策略,而内存感知测试时扩展通过生成多样化经验加速这一学习过程,实现持续自我进化。
English Summary: ReasoningBank is a memory framework that enables language model agents to learn and generalize reasoning strategies from past successes and failures, while memory-aware test-time scaling accelerates this learning by generating diverse experiences for continuous self-improvement.

Authors:Masoud Kaveh, Farshad Rostami Ghadi, Francisco Hernando-Gallego, Diego Martín, Kai-Kit Wong, Riku Jäntti
Title: Physical Layer Security over Fluid Reconfigurable Intelligent Surface-assisted Communication Systems
Abstract:
This letter investigates the secrecy performance of wireless communication systems assisted by a fluid reconfigurable intelligent surface (FRIS). Unlike conventional reconfigurable intelligent surfaces (RISs) with fixed geometries, FRISs dynamically select a subset of reflective elements based on real-time channel conditions, offering enhanced spatial diversity and adaptability. Using this foundation, we model a secure downlink scenario where a base station communicates with a legitimate user in the presence of an eavesdropper, and the propagation is assisted by a FRIS with a limited number of elements set to the ON state. We analyze the system's secrecy performance under spatial correlation by deriving analytical lower and upper bounds for the secrecy outage probability (SOP) and average secrecy capacity (ASC), respectively. Our results demonstrate that FRIS effectively enables secure communication under spatial correlation. Even with partial activation, FRIS significantly outperforms conventional RISs in enhancing secrecy performance under varying deployment densities and element correlations.
中文摘要:本研究证明,流体可重构智能表面(FRIS)通过动态选择反射单元,在空间相关性条件下即使部分激活也能显著提升无线通信安全性能,优于传统可重构智能表面。
English Summary: This study demonstrates that fluid reconfigurable intelligent surfaces (FRIS) significantly enhance wireless communication security by dynamically selecting reflective elements, outperforming conventional RIS in secrecy performance even with partial activation under spatial correlation.

Authors:Matteo Zecchin, Unnikrishnan Kunnath Ganesan, Giuseppe Durisi, Petar Popovski, Osvaldo Simeone
Title: Prediction-Powered Communication with Distortion Guarantees
Abstract:
The development of 6G wireless systems is taking place alongside the development of increasingly intelligent wireless devices and network nodes. The changing technological landscape is motivating a rethinking of classical Shannon information theory that emphasizes semantic and task-oriented paradigms. In this paper, we study a prediction-powered communication setting, in which devices, equipped with artificial intelligence (AI)-based predictors, communicate under zero-delay constraints with strict distortion guarantees. Two classes of distortion measures are considered: (i) outage-based metrics, suitable for tasks tolerating occasional packet losses, such as real-time control or monitoring; and (ii) bounded distortion metrics, relevant to semantic-rich tasks like text or video transmission. We propose two zero-delay compression algorithms leveraging online conformal prediction to provide per-sequence guarantees on the distortion of reconstructed sequences over error-free and packet-erasure channels with feedback. For erasure channels, we introduce a doubly-adaptive conformal update to compensate for channel-induced errors and derive sufficient conditions on erasure statistics to ensure distortion constraints. Experiments on semantic text compression validate the approach, showing significant bit rate reductions while strictly meeting distortion guarantees compared to state-of-the-art prediction-powered compression methods.
中文摘要:本文提出基于在线保形预测的零延迟压缩算法,为6G网络中配备AI的设备提供严格失真保证,在语义文本压缩中显著降低比特率。
English Summary: The paper introduces zero-delay compression algorithms using online conformal prediction to ensure strict distortion guarantees for AI-equipped devices in 6G networks, achieving significant bit rate reductions in semantic text compression.

Authors:Yingdong Hu, Yisheng He, Jinnan Chen, Weihao Yuan, Kejie Qiu, Zehong Lin, Siyu Zhu, Zilong Dong, Jun Zhang
Title: Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos
Abstract:
Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: https://zhenliuzju.github.io/huyingdong/Forge4D.
中文: Forge4D是一种前馈模型,通过流式3D高斯重建和自监督运动预测,从稀疏视角视频高效重构动态4D人体,实现新视角和新时间的合成。
English: Forge4D is a feed-forward model that efficiently reconstructs dynamic 4D humans from sparse-view videos, enabling novel view and time synthesis through streaming 3D Gaussian reconstruction and self-supervised motion prediction.

Authors:Linwei Tao, Yi-Fan Yeh, Bo Kai, Minjing Dong, Tao Huang, Tom A. Lamb, Jialin Yu, Philip H. S. Torr, Chang Xu
Title: Can Large Language Models Express Uncertainty Like Human?
Abstract:
Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.
在高风险应用中,语言置信度通过使用模糊限制语为传统不确定性估计提供了轻量且以人为本的替代方案,新数据集和方法显著提升了其在各类大语言模型中的可靠性。
In high-stakes applications, linguistic confidence offers a lightweight and human-aligned alternative to traditional uncertainty estimation by using hedging expressions, with new datasets and methods enhancing its reliability across LLMs.

Authors:Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen, Wenbo Li, Xiao Zhang, Yulun Zhang, Jian Chen
Title: VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement
Abstract:
Video Face Enhancement (VFE) seeks to reconstruct high-quality facial regions from degraded video sequences, a capability that underpins numerous applications including video conferencing, film restoration, and surveillance. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) faithfully modeling intricate facial textures while preserving temporal consistency; (2) restricted model generalization due to the lack of high-quality face video training data; and (3) low efficiency caused by repeated denoising steps during inference. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for video face enhancement. Built upon the pretrained WANX video generation model, our method leverages powerful spatiotemporal priors through a single-step flow matching paradigm, enabling direct mapping from degraded inputs to high-quality outputs with significantly reduced inference time. To further boost efficiency, we propose a Joint Latent-Pixel Face-Focused Training strategy that employs stochastic switching between facial region optimization and global reconstruction, providing explicit supervision in both latent and pixel spaces through a progressive two-stage training process. Additionally, we introduce an MLLM-driven data curation pipeline for automated selection of high-quality video face datasets, enhancing model generalization. Extensive experiments demonstrate that VividFace achieves state-of-the-art results in perceptual quality, identity preservation, and temporal stability, while offering practical resources for the research community.
Chinese: VividFace是一种新颖的一步扩散框架,通过利用时空先验和联合训练策略,有效提升视频面部质量,克服了纹理建模、泛化能力和推理效率等关键挑战,并取得了最先进的性能。
English: VividFace is a novel one-step diffusion framework that efficiently enhances video facial quality by leveraging spatiotemporal priors and a joint training strategy, overcoming key challenges in texture modeling, generalization, and inference speed while achieving state-of-the-art results.

Authors:Yapeng Mi, Hengli Li, Yanpeng Zhao, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, Qing Li
Title: MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning
Abstract:
Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR's non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.
Chinese: MILR是一种新颖的测试时方法,通过在统一潜在空间中对图像和文本进行联合推理来增强图像生成,无需微调即可在多个基准测试中取得最先进的结果。
English: MILR is a novel test-time method that enhances image generation by jointly reasoning over image and text in a unified latent space, achieving state-of-the-art results across multiple benchmarks without requiring fine-tuning.

Authors:Xinyu Liu, Bei Li, Jiahao Liu, Junhao Ruan, Kechen Jiao, Hongyin Tang, Jingang Wang, Xiao Tong, Jingbo Zhu
Title: IIET: Efficient Numerical Transformer via Implicit Iterative Euler Method
Abstract:
High-order numerical methods enhance Transformer performance in tasks like NLP and CV, but introduce a performance-efficiency trade-off due to increased computational overhead. Our analysis reveals that conventional efficiency techniques, such as distillation, can be detrimental to the performance of these models, exemplified by PCformer. To explore more optimizable ODE-based Transformer architectures, we propose the \textbf{I}terative \textbf{I}mplicit \textbf{E}uler \textbf{T}ransformer \textbf{(IIET)}, which simplifies high-order methods using an iterative implicit Euler approach. This simplification not only leads to superior performance but also facilitates model compression compared to PCformer. To enhance inference efficiency, we introduce \textbf{I}teration \textbf{I}nfluence-\textbf{A}ware \textbf{D}istillation \textbf{(IIAD)}. Through a flexible threshold, IIAD allows users to effectively balance the performance-efficiency trade-off. On lm-evaluation-harness, IIET boosts average accuracy by 2.65\% over vanilla Transformers and 0.8\% over PCformer. Its efficient variant, E-IIET, significantly cuts inference overhead by 55\% while retaining 99.4\% of the original task accuracy. Moreover, the most efficient IIET variant achieves an average performance gain exceeding 1.6\% over vanilla Transformer with comparable speed.
中文: 提出的迭代隐式欧拉变换器(IIET)通过简化高阶方法提升了Transformer性能,并结合迭代影响感知蒸馏(IIAD),在显著降低推理开销的同时保持高任务精度,实现了更优的准确性与效率平衡。
English: The proposed Iterative Implicit Euler Transformer (IIET) enhances Transformer performance by simplifying high-order methods and, combined with Iteration Influence-Aware Distillation (IIAD), achieves superior accuracy and efficiency, significantly reducing inference overhead while maintaining high task accuracy.

Authors:Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou
Title: Stochastic activations
Abstract:
We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
中文摘要:我们提出随机激活方法,通过在训练中随机选择SILU或RELU函数来规避RELU的梯度流问题,不仅提升了模型性能与推理效率,还为文本生成提供了可控的多样性增强方案。
English Summary: We propose stochastic activations, a method that randomly alternates between SILU and RELU functions during training to overcome RELU's gradient flow limitations, achieving improved performance and inference efficiency while offering enhanced text generation diversity.

Authors:Bowen Wang, Matteo Zecchin, Osvaldo Simeone
Title: Distributed Associative Memory via Online Convex Optimization
Abstract:
An associative memory (AM) enables cue-response recall, and associative memorization has recently been noted to underlie the operation of modern neural architectures such as Transformers. This work addresses a distributed setting where agents maintain a local AM to recall their own associations as well as selective information from others. Specifically, we introduce a distributed online gradient descent method that optimizes local AMs at different agents through communication over routing trees. Our theoretical analysis establishes sublinear regret guarantees, and experiments demonstrate that the proposed protocol consistently outperforms existing online optimization baselines.
中文摘要:本文提出了一种分布式在线梯度下降方法,通过路由树上的通信优化各智能体的局部联想记忆,实现了次线性遗憾并优于现有在线优化基准方法。
English Summary: This paper introduces a distributed online gradient descent method for optimizing local associative memories across agents via communication over routing trees, achieving sublinear regret and outperforming existing online optimization baselines.

Authors:Runmin Wu, Mengyang Feng, Wenlong Guan, Dong Wang, Huchuan Lu, Errui Ding
Title: A Mutual Learning Method for Salient Object Detection with intertwined Multi-Supervision--Revised
Abstract:
Though deep learning techniques have made great progress in salient object detection recently, the predicted saliency maps still suffer from incomplete predictions due to the internal complexity of objects and inaccurate boundaries caused by strides in convolution and pooling operations. To alleviate these issues, we propose to train saliency detection networks by exploiting the supervision from not only salient object detection, but also foreground contour detection and edge detection. First, we leverage salient object detection and foreground contour detection tasks in an intertwined manner to generate saliency maps with uniform highlight. Second, the foreground contour and edge detection tasks guide each other simultaneously, thereby leading to precise foreground contour prediction and reducing the local noises for edge prediction. In addition, we develop a novel mutual learning module (MLM) which serves as the building block of our method. Each MLM consists of multiple network branches trained in a mutual learning manner, which improves the performance by a large margin. Extensive experiments on seven challenging datasets demonstrate that the proposed method has delivered state-of-the-art results in both salient object detection and edge detection.
中文摘要:该方法通过结合显著目标检测、前景轮廓检测和边缘检测的监督信息,并利用新型互学习模块进行训练,在多个数据集上实现了最先进的检测效果。
English Summary: The proposed method enhances salient object detection by integrating supervision from object, contour, and edge detection tasks through a novel mutual learning module, achieving state-of-the-art results across multiple datasets.

Authors:Junfeng Yan, Biao Wu, Meng Fang, Ling Chen
Title: Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
Abstract:
Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.
中文摘要:本研究针对车载图形用户界面的特殊挑战,推出了首个高保真基准平台Automotive-ENV,并提出集成地理位置感知的ASURADA智能体,通过动态适应驾驶环境显著提升了安全任务的执行效果。
English Summary: The study introduces Automotive-ENV, a specialized benchmark for vehicle GUI interactions addressing driver attention and safety challenges, and proposes ASURADA, a geo-aware agent that improves task performance by integrating location-based context.

Authors:Junfeng Yan, Biao Wu, Meng Fang, Ling Chen
Title: Automotive-ENV: Benchmarking Multimodal Agents in Vehicle Interface Systems
Abstract:
Multimodal agents have demonstrated strong performance in general GUI interactions, but their application in automotive systems has been largely unexplored. In-vehicle GUIs present distinct challenges: drivers' limited attention, strict safety requirements, and complex location-based interaction patterns. To address these challenges, we introduce Automotive-ENV, the first high-fidelity benchmark and interaction environment tailored for vehicle GUIs. This platform defines 185 parameterized tasks spanning explicit control, implicit intent understanding, and safety-aware tasks, and provides structured multimodal observations with precise programmatic checks for reproducible evaluation. Building on this benchmark, we propose ASURADA, a geo-aware multimodal agent that integrates GPS-informed context to dynamically adjust actions based on location, environmental conditions, and regional driving norms. Experiments show that geo-aware information significantly improves success on safety-aware tasks, highlighting the importance of location-based context in automotive environments. We will release Automotive-ENV, complete with all tasks and benchmarking tools, to further the development of safe and adaptive in-vehicle agents.
中文摘要:本研究针对车载图形用户界面的特殊挑战,推出了首个高保真基准平台Automotive-ENV,并提出集成地理位置感知的ASURADA智能体,通过动态适应驾驶环境显著提升了安全任务的执行效果。
English Summary: The study introduces Automotive-ENV, a specialized benchmark for vehicle GUI interactions addressing driver attention and safety challenges, and proposes ASURADA, a geo-aware agent that improves task performance by integrating location-based context.

Authors:Sihang Nie, Xiaofen Xing, Jingyuan Xing, Baiji Liu, Xiangmin Xu
Title: HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens for Instruction-based TTS
Abstract:
Large Language Model (LLM)-based Text-to-Speech (TTS) models have already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this limitation, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we propose a hierarchical decoding strategy, where the LLM generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis. Audio samples are available at https://xxh333.github.io/.
中文:提出的HD-PPT框架通过分层语音标记和结构化解码来弥合模态差距,在实现顶级自然度的同时,实现了对语音合成的精准控制。
English: The proposed HD-PPT framework introduces hierarchical speech tokens and structured decoding to bridge the modality gap, enabling precise control over speech synthesis while achieving state-of-the-art naturalness.

Authors:Shaoheng Wang, Yao Lu, Yuqi Li, Yaxin Gao, Jiaqi Nie, Shanqing Yu, Yingli Tian, Qi Xuan
Title: LoRALib: A Standardized Benchmark for Evaluating LoRA-MoE Methods
Abstract:
As a parameter efficient fine-tuning (PEFT) method, low-rank adaptation (LoRA) can save significant costs in storage and computing, but its strong adaptability to a single task is often accompanied by insufficient cross-task generalization capabilities. To improve this, existing work combines LoRA with mixture-of-experts (MoE) to enhance the model's adaptability through expert modules and routing mechanisms. However, existing LoRA-MoE methods lack unified standards in models, datasets, hyperparameters, and evaluation methods, making it difficult to conduct fair comparisons between different methods. To this end, we proposed a unified benchmark named LoRALib. Specifically, we standardized datasets from $40$ downstream tasks into a unified format, fine-tuned them using the same hyperparameters and obtained $680$ LoRA modules across $17$ model architectures. Based on this LoRA library, we conduct large-scale experiments on $3$ representative LoRA-MoE methods and different LoRA selection mechanisms using the open-sourced testing tool OpenCompass. Extensive experiments show that LoRAMoE performs best, and that prioritizing LoRAs relevant to the target task can further improve the performance of MoE. We hope these findings will inspire future work. Our datasets and LoRA library are available at https://huggingface.co/datasets/YaoLuzjut/LoRAOcean_dataset and https://huggingface.co/YaoLuzjut/models.
Chinese: 针对LoRA在单任务适应性强但跨任务泛化能力不足的问题,我们建立了LoRALib统一基准,通过标准化数据集和超参数,发现LoRAMoE表现最佳且优先选择任务相关LoRA能进一步提升混合专家模型性能。
English: LoRA's strong single-task performance lacks cross-task generalization, so we created LoRALib, a unified benchmark that standardizes datasets and hyperparameters, finding LoRAMoE excels and task-relevant LoRA selection boosts MoE performance.

Authors:Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
Title: UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Abstract:
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
中文摘要:UniPixel模型通过整合视觉提示与掩码生成,实现了像素级细粒度推理,在包括新型PixelQA任务在内的多个基准测试中验证了其有效性。
English Summary: The UniPixel model bridges the gap in pixel-level understanding by integrating visual prompts with mask generation for fine-grained reasoning, validated across multiple benchmarks including a novel PixelQA task.

Authors:Yunchu Han, Zhaojun Nan, Sheng Zhou, Zhisheng Niu
Title: Joint Memory Frequency and Computing Frequency Scaling for Energy-efficient DNN Inference
Abstract:
Deep neural networks (DNNs) have been widely applied in diverse applications, but the problems of high latency and energy overhead are inevitable on resource-constrained devices. To address this challenge, most researchers focus on the dynamic voltage and frequency scaling (DVFS) technique to balance the latency and energy consumption by changing the computing frequency of processors. However, the adjustment of memory frequency is usually ignored and not fully utilized to achieve efficient DNN inference, which also plays a significant role in the inference time and energy consumption. In this paper, we first investigate the impact of joint memory frequency and computing frequency scaling on the inference time and energy consumption with a model-based and data-driven method. Then by combining with the fitting parameters of different DNN models, we give a preliminary analysis for the proposed model to see the effects of adjusting memory frequency and computing frequency simultaneously. Finally, simulation results in local inference and cooperative inference cases further validate the effectiveness of jointly scaling the memory frequency and computing frequency to reduce the energy consumption of devices.
Chinese: 本文提出了一种基于模型和数据驱动的方法来联合调整内存与计算频率,通过仿真验证了该方法在本地和协同DNN推理场景中均能有效降低设备能耗。
English: This paper proposes a model-based and data-driven approach to jointly scale memory and computing frequencies, demonstrating through simulations that this method effectively reduces energy consumption in both local and cooperative DNN inference scenarios.

Authors:Yunchu Han, Zhaojun Nan, Sheng Zhou, Zhisheng Niu
Title: Joint Memory Frequency and Computing Frequency Scaling for Energy-efficient DNN Inference
Abstract:
Deep neural networks (DNNs) have been widely applied in diverse applications, but the problems of high latency and energy overhead are inevitable on resource-constrained devices. To address this challenge, most researchers focus on the dynamic voltage and frequency scaling (DVFS) technique to balance the latency and energy consumption by changing the computing frequency of processors. However, the adjustment of memory frequency is usually ignored and not fully utilized to achieve efficient DNN inference, which also plays a significant role in the inference time and energy consumption. In this paper, we first investigate the impact of joint memory frequency and computing frequency scaling on the inference time and energy consumption with a model-based and data-driven method. Then by combining with the fitting parameters of different DNN models, we give a preliminary analysis for the proposed model to see the effects of adjusting memory frequency and computing frequency simultaneously. Finally, simulation results in local inference and cooperative inference cases further validate the effectiveness of jointly scaling the memory frequency and computing frequency to reduce the energy consumption of devices.
Chinese: 本文提出了一种基于模型和数据驱动的方法来联合调整内存与计算频率,通过仿真验证了该方法在本地和协同DNN推理场景中均能有效降低设备能耗。
English: This paper proposes a model-based and data-driven approach to jointly scale memory and computing frequencies, demonstrating through simulations that this method effectively reduces energy consumption in both local and cooperative DNN inference scenarios.

Authors:Fei Zhao, Chengqiang Lu, Yufan Shen, Qimeng Wang, Yicheng Qian, Haoxin Zhang, Yan Gao, Yi Wu, Yao Hu, Zhen Wu, Shangyu Xing, Xinyu Dai
Title: RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios
Abstract:
While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8\% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context.
中文:RealBench是首个中文多模态多图像数据集,包含9393个样本和69910张图像,采用真实用户生成内容,覆盖多样化场景,实验表明现有模型处理中文多图像场景仍面临挑战,开源与闭源模型存在显著性能差距。
English: RealBench is the first Chinese multimodal multi-image dataset with 9393 samples and 69910 images, featuring real user-generated content and diverse scenes, which challenges existing models and reveals significant performance gaps between open-source and closed-source models.

Authors:Wenyao Li, Ran Zhang, Pengyang Wang, Yuanchun Zhou, Pengfei Wang
Title: Zero-Shot Human Mobility Forecasting via Large Language Model with Hierarchical Reasoning
Abstract:
Human mobility forecasting is important for applications such as transportation planning, urban management, and personalized recommendations. However, existing methods often fail to generalize to unseen users or locations and struggle to capture dynamic intent due to limited labeled data and the complexity of mobility patterns. We propose ZHMF, a framework for zero-shot human mobility forecasting that combines a semantic enhanced retrieval and reflection mechanism with a hierarchical language model based reasoning system. The task is reformulated as a natural language question answering paradigm. Leveraging LLMs semantic understanding of user histories and context, our approach handles previously unseen prediction scenarios. We further introduce a hierarchical reflection mechanism for iterative reasoning and refinement by decomposing forecasting into an activity level planner and a location level selector, enabling collaborative modeling of long term user intentions and short term contextual preferences. Experiments on standard human mobility datasets show that our approach outperforms existing models. Ablation studies reveal the contribution of each module, and case studies illustrate how the method captures user intentions and adapts to diverse contextual scenarios.
中文摘要:提出的ZHMF框架通过结合语义检索与分层语言模型,改进了零样本人类移动预测,能有效处理未知场景并通过迭代推理超越现有方法。
English Summary: The proposed ZHMF framework enhances zero-shot human mobility forecasting by integrating semantic retrieval with a hierarchical language model, effectively handling unseen scenarios through iterative reasoning and outperforming existing methods.

Authors:Xihong Yang, Siwei Wang, Jiaqi Jin, Fangdi Wang, Tianrui Liu, Yueming Jin, Xinwang Liu, En Zhu, Kunlun He
Title: Generalized Deep Multi-view Clustering via Causal Learning with Partially Aligned Cross-view Correspondence
Abstract:
Multi-view clustering (MVC) aims to explore the common clustering structure across multiple views. Many existing MVC methods heavily rely on the assumption of view consistency, where alignments for corresponding samples across different views are ordered in advance. However, real-world scenarios often present a challenge as only partial data is consistently aligned across different views, restricting the overall clustering performance. In this work, we consider the model performance decreasing phenomenon caused by data order shift (i.e., from fully to partially aligned) as a generalized multi-view clustering problem. To tackle this problem, we design a causal multi-view clustering network, termed CauMVC. We adopt a causal modeling approach to understand multi-view clustering procedure. To be specific, we formulate the partially aligned data as an intervention and multi-view clustering with partially aligned data as an post-intervention inference. However, obtaining invariant features directly can be challenging. Thus, we design a Variational Auto-Encoder for causal learning by incorporating an encoder from existing information to estimate the invariant features. Moreover, a decoder is designed to perform the post-intervention inference. Lastly, we design a contrastive regularizer to capture sample correlations. To the best of our knowledge, this paper is the first work to deal generalized multi-view clustering via causal learning. Empirical experiments on both fully and partially aligned data illustrate the strong generalization and effectiveness of CauMVC.
中文: 本文提出CauMVC因果多视图聚类网络,通过将部分对齐数据建模为干预,并利用变分自编码器提取不变特征,有效解决了多视图数据部分对齐时的聚类性能下降问题。
English: This paper introduces CauMVC, a causal multi-view clustering network that addresses the challenge of partially aligned data across views by modeling it as an intervention and using a variational auto-encoder to extract invariant features for improved clustering performance.

Authors:Yaxin Gao, Yao Lu, Zongfei Zhang, Jiaqi Nie, Shanqing Yu, Qi Xuan
Title: DSPC: Dual-Stage Progressive Compression Framework for Efficient Long-Context Reasoning
Abstract:
Large language models (LLMs) have achieved remarkable success in many natural language processing (NLP) tasks. To achieve more accurate output, the prompts used to drive LLMs have become increasingly longer, which incurs higher computational costs. To address this prompt inflation problem, prompt compression has been proposed. However, most existing methods require training a small auxiliary model for compression, incurring a significant amount of additional computation. To avoid this, we propose a two-stage, training-free approach, called Dual-Stage Progressive Compression (DSPC). In the coarse-grained stage, semantic-related sentence filtering removes sentences with low semantic value based on TF-IDF. In the fine-grained stage, token importance is assessed using attention contribution, cross-model loss difference, and positional importance, enabling the pruning of low-utility tokens while preserving semantics. We validate DSPC on LLaMA-3.1-8B-Instruct and GPT-3.5-Turbo under a constrained token budget and observe consistent improvements. For instance, in the FewShot task of the Longbench dataset, DSPC achieves a performance of 49.17 by using only 3x fewer tokens, outperforming the best state-of-the-art baseline LongLLMLingua by 7.76.
中文: 针对大语言模型中长提示计算成本高的问题,本研究提出无需训练的双阶段渐进压缩法,通过语义筛选和细粒度剪枝在保留语义的同时显著减少令牌使用,并在实验中实现性能突破。
English: To address the computational inefficiency of lengthy prompts in large language models, this study introduces a training-free, dual-stage compression method that filters low-value sentences and tokens while maintaining semantic integrity, achieving superior performance with significantly reduced token usage.

Authors:Zhuoxuan Zhang, Jinhao Duan, Edward Kim, Kaidi Xu
Title: Sparse Neurons Carry Strong Signals of Question Ambiguity in LLMs
Abstract:
Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model's pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model's processing pipeline. Finally, we show that through manipulating AENs, we can control LLM's behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.
中文: 研究表明大型语言模型通过少量神经元线性编码问题歧义性,不仅可检测歧义,还能通过调控神经元实现从直接回答到主动弃权的行为控制。
English: This study demonstrates that large language models linearly encode question ambiguity within a small number of neurons, enabling both detection and behavioral control from direct answering to abstention through neuron manipulation.

Authors:Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
Title: Image Tokenizer Needs Post-Training
Abstract:
Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a $\sim$400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.
中文: 本文提出包含主训练与后训练的新型分词器训练方案,通过模拟生成噪声和优化解码过程,有效解决了图像生成模型中重建与生成分布差异问题,显著提升了生成质量与收敛速度。
English: This paper introduces a novel tokenizer training scheme with main and post-training phases to address the discrepancy between reconstruction and generation distributions in image generative models, significantly improving generation quality and convergence speed.

Authors:Zirui Zheng, Takashi Isobe, Tong Shen, Xu Jia, Jianbin Zhao, Xiaomin Li, Mengmeng Ge, Baolu Li, Qinghe Wang, Dong Li, Dong Zhou, Yunzhi Zhuge, Huchuan Lu, Emad Barsoum
Title: Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking
Abstract:
While autoregressive (AR) models have demonstrated remarkable success in image generation, extending them to layout-conditioned generation remains challenging due to the sparse nature of layout conditions and the risk of feature entanglement. We present Structured Masking for AR-based Layout-to-Image (SMARLI), a novel framework for layoutto-image generation that effectively integrates spatial layout constraints into AR-based image generation. To equip AR model with layout control, a specially designed structured masking strategy is applied to attention computation to govern the interaction among the global prompt, layout, and image tokens. This design prevents mis-association between different regions and their descriptions while enabling sufficient injection of layout constraints into the generation process. To further enhance generation quality and layout accuracy, we incorporate Group Relative Policy Optimization (GRPO) based post-training scheme with specially designed layout reward functions for next-set-based AR models. Experimental results demonstrate that SMARLI is able to seamlessly integrate layout tokens with text and image tokens without compromising generation quality. It achieves superior layoutaware control while maintaining the structural simplicity and generation efficiency of AR models.
中文摘要:SMARLI通过结构化掩码策略将布局约束融入自回归图像生成,在防止特征错误关联的同时,通过优化训练保持了生成质量。
English Summary: SMARLI introduces a structured masking strategy to integrate layout constraints into autoregressive image generation, preventing feature mis-association while maintaining generation quality through optimized training.

Authors:Kevin Wilkinghoff, Haici Yang, Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux
Title: Local Density-Based Anomaly Score Normalization for Domain Generalization
Abstract:
State-of-the-art anomalous sound detection (ASD) systems in domain-shifted conditions rely on projecting audio signals into an embedding space and using distance-based outlier detection to compute anomaly scores. One of the major difficulties to overcome is the so-called domain mismatch between the anomaly score distributions of a source domain and a target domain that differ acoustically and in terms of the amount of training data provided. A decision threshold that is optimal for one domain may be highly sub-optimal for the other domain and vice versa. This significantly degrades the performance when only using a single decision threshold, as is required when generalizing to multiple data domains that are possibly unseen during training while still using the same trained ASD system as in the source domain. To reduce this mismatch between the domains, we propose a simple local-density-based anomaly score normalization scheme. In experiments conducted on several ASD datasets, we show that the proposed normalization scheme consistently improves performance for various types of embedding-based ASD systems and yields better results than existing anomaly score normalization approaches.
中文: 提出的基于局部密度的异常分数归一化方法有效减轻了异常声音检测中的领域不匹配问题,显著提升了多种嵌入式系统的性能,并优于现有的归一化技术。
English: The proposed local-density-based anomaly score normalization effectively reduces domain mismatch in anomalous sound detection systems, consistently enhancing performance across various embedding-based approaches and outperforming existing normalization methods.

Authors:Jiahao Luo, Chaoyang Wang, Michael Vasilkovsky, Vladislav Shakhrai, Di Liu, Peiye Zhuang, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee, James Davis, Jian Wang
Title: T2Bs: Text-to-Character Blendshapes via Video Generation
Abstract:
We present T2Bs, a framework for generating high-quality, animatable character head morphable models from text by combining static text-to-3D generation with video diffusion. Text-to-3D models produce detailed static geometry but lack motion synthesis, while video diffusion models generate motion with temporal and multi-view geometric inconsistencies. T2Bs bridges this gap by leveraging deformable 3D Gaussian splatting to align static 3D assets with video outputs. By constraining motion with static geometry and employing a view-dependent deformation MLP, T2Bs (i) outperforms existing 4D generation methods in accuracy and expressiveness while reducing video artifacts and view inconsistencies, and (ii) reconstructs smooth, coherent, fully registered 3D geometries designed to scale for building morphable models with diverse, realistic facial motions. This enables synthesizing expressive, animatable character heads that surpass current 4D generation techniques.
Chinese: T2Bs框架通过结合静态文本到3D生成与视频扩散技术,构建高质量可动画角色头部形变模型,有效解决了运动合成不足和几何不一致问题,实现了超越现有4D生成方法的生动表现效果。
English: T2Bs is a framework that integrates static text-to-3D generation with video diffusion to create high-quality, animatable character head morphable models, overcoming motion synthesis limitations and geometric inconsistencies to produce expressive, artifact-free results.

Authors:Jiahao Luo, Chaoyang Wang, Michael Vasilkovsky, Vladislav Shakhrai, Di Liu, Peiye Zhuang, Sergey Tulyakov, Peter Wonka, Hsin-Ying Lee, James Davis, Jian Wang
Title: T2Bs: Text-to-Character Blendshapes via Video Generation
Abstract:
We present T2Bs, a framework for generating high-quality, animatable character head morphable models from text by combining static text-to-3D generation with video diffusion. Text-to-3D models produce detailed static geometry but lack motion synthesis, while video diffusion models generate motion with temporal and multi-view geometric inconsistencies. T2Bs bridges this gap by leveraging deformable 3D Gaussian splatting to align static 3D assets with video outputs. By constraining motion with static geometry and employing a view-dependent deformation MLP, T2Bs (i) outperforms existing 4D generation methods in accuracy and expressiveness while reducing video artifacts and view inconsistencies, and (ii) reconstructs smooth, coherent, fully registered 3D geometries designed to scale for building morphable models with diverse, realistic facial motions. This enables synthesizing expressive, animatable character heads that surpass current 4D generation techniques.
Chinese: T2Bs框架通过结合静态文本到3D生成与视频扩散技术,构建高质量可动画角色头部形变模型,有效解决了运动合成不足和几何不一致问题,实现了超越现有4D生成方法的生动表现效果。
English: T2Bs is a framework that integrates static text-to-3D generation with video diffusion to create high-quality, animatable character head morphable models, overcoming motion synthesis limitations and geometric inconsistencies to produce expressive, artifact-free results.

Authors:Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng
Title: VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
Abstract:
Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at \href{https://junzhan2000.github.io/VStyle.github.io/}{project's homepage}.
中文: 本文提出了语音风格适应(VSA)这一新任务,旨在评估口语模型根据语音指令调整说话风格的能力,并发布了VStyle双语基准和LALM评估框架,揭示了当前模型在此任务上的明显局限。
English: This paper introduces Voice Style Adaptation (VSA), a new task for spoken language models to modify speaking styles based on spoken commands, and presents the VStyle benchmark and LALM as a Judge framework to evaluate current models' limitations in this area.

Authors:Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, Nanyun Peng
Title: Steering MoE LLMs via Expert (De)Activation
Abstract:
Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.
中文摘要:SteerMoE框架通过检测并选择性激活与特定行为相关的专家,可在不重新训练的情况下将大型语言模型的安全性和忠实度分别提升最高20%和27%,同时也揭示了专家系统中隐藏的对抗性攻击漏洞,能完全绕过安全防护。
English Summary: SteerMoE is a framework that controls behaviors in Mixture-of-Experts language models by detecting and selectively activating experts linked to specific traits, enhancing safety and faithfulness by up to 27% without retraining, while also revealing vulnerabilities to adversarial attacks that can bypass safety measures.

Authors:Rui Xu, Yinghui Ye, Xiaoli Chu, Guangyue Lu, Farshad Rostami Ghadi, Kai-Kit Wong
Title: Gaussian Copula-Based Outage Performance Analysis of Fluid Antenna Systems: Channel Coefficient- or Envelope-Level Correlation Matrix?
Abstract:
Gaussian copula has been employed to evaluate the outage performance of Fluid Antenna Systems (FAS), with the covariance matrix reflecting the dependence among multivariate normal random variables (RVs). While prior studies approximate this matrix using the channel coefficient correlation matrix from Jake's model, this work instead employs the channel envelope correlation matrix, motivated by the fact that the multivariate normal RVs are generated by transforming correlated channel envelopes. This raises an open question of whether using the coefficient- or envelope-level correlation matrix yields better accuracy in accessing FAS performance. Toward this end, this paper explores the benefits of using the envelope-level correlation matrix under fully correlated Nakagami-m fading, and develops a method for generating such fading channels for Monte Carlo simulations, which serve as a benchmark for validating the theoretical results. Simulation results confirm the effectiveness of the proposed channel modeling approach and demonstrate the superior accuracy of using the envelope-level correlation matrix, particularly in sparse port deployment and low-outage regime.
中文摘要:本文研究表明,在评估流体天线系统性能时,采用包络级相关矩阵比系数级相关矩阵具有更高精度,尤其在稀疏端口部署和低中断概率场景下表现更为优越。
English Summary: This paper demonstrates that using the envelope-level correlation matrix provides superior accuracy for evaluating Fluid Antenna System performance, especially in sparse port deployments and low-outage scenarios, compared to the coefficient-level matrix approximation.

Authors:Kechen Jiao, Zhirui Fang, Jiahao Liu, Bei Li, Qifan Wang, Xinyu Liu, Junhao Ruan, Zhongjian Qiao, Yifan Zhu, Yaxin Xu, Jingang Wang, Xiu Li
Title: TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making
Abstract:
Using effective generalization capabilities of vision language models (VLMs) in context-specific dynamic tasks for embodied artificial intelligence remains a significant challenge. Although supervised fine-tuned models can better align with the real physical world, they still exhibit sluggish responses and hallucination issues in dynamically changing environments, necessitating further alignment. Existing post-SFT methods, reliant on reinforcement learning and chain-of-thought (CoT) approaches, are constrained by sparse rewards and action-only optimization, resulting in low sample efficiency, poor consistency, and model degradation. To address these issues, this paper proposes Thought-Centric Preference Optimization (TCPO) for effective embodied decision-making. Specifically, TCPO introduces a stepwise preference-based optimization approach, transforming sparse reward signals into richer step sample pairs. It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation. Moreover, by incorporating Action Policy Consistency Constraint (APC), it further imposes consistency constraints on the model output. Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM and validating the effectiveness of our approach in mitigating model degradation after fine-tuning. These results highlight the potential of integrating preference-based learning techniques with CoT processes to enhance the decision-making capabilities of vision-language models in embodied agents.
中文: 本文提出思维中心偏好优化方法,通过对齐中间推理过程并施加动作一致性约束来增强视觉语言模型在具身决策中的能力,实验实现了6%的性能提升。
English: This paper introduces Thought-Centric Preference Optimization (TCPO) to enhance embodied decision-making in vision-language models by aligning intermediate reasoning processes and applying action consistency constraints, achieving a 6% performance improvement in experiments.

Authors:Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Zilong Dong, Yike Guo
Title: PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image
Abstract:
We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work.
中文: 本文提出了一种前馈框架,能够从单张无姿态图像中高效合成高斯全头模型,通过单次前向传播实现快速重建与渲染,并采用由粗到精的生成流程及双分支特征聚合策略,有效提升了重建质量。
English: This paper introduces a feed-forward framework for efficiently synthesizing a Gaussian full-head model from a single unposed image in one forward pass, eliminating the need for GAN inversion or test-time optimization, and utilizes a coarse-to-fine pipeline with a dual-branch feature aggregation approach for high-fidelity reconstruction.

Authors:Yuxuan Bai, Yuxuan Sun, Tan Chen, Wei Chen, Sheng Zhou, Zhisheng Niu
Title: FedTeddi: Temporal Drift and Divergence Aware Scheduling for Timely Federated Edge Learning
Abstract:
Federated edge learning (FEEL) enables collaborative model training across distributed clients over wireless networks without exposing raw data. While most existing studies assume static datasets, in real-world scenarios clients may continuously collect data with time-varying and non-independent and identically distributed (non-i.i.d.) characteristics. A critical challenge is how to adapt models in a timely yet efficient manner to such evolving data. In this paper, we propose FedTeddi, a temporal-drift-and-divergence-aware scheduling algorithm that facilitates fast convergence of FEEL under dynamic data evolution and communication resource limits. We first quantify the temporal dynamics and non-i.i.d. characteristics of data using temporal drift and collective divergence, respectively, and represent them as the Earth Mover's Distance (EMD) of class distributions for classification tasks. We then propose a novel optimization objective and develop a joint scheduling and bandwidth allocation algorithm, enabling the FEEL system to learn from new data quickly without forgetting previous knowledge. Experimental results show that our algorithm achieves higher test accuracy and faster convergence compared to benchmark methods, improving the rate of convergence by 58.4% on CIFAR-10 and 49.2% on CIFAR-100 compared to random scheduling.
Chinese: FedTeddi是一种调度算法,通过时间漂移和差异度量有效管理动态非独立同分布数据演化,从而提升联邦边缘学习的收敛速度和测试精度。
English: FedTeddi is a scheduling algorithm that enhances federated edge learning by efficiently managing dynamic and non-i.i.d. data evolution through temporal drift and divergence metrics, achieving faster convergence and higher accuracy.

Authors:Filippo Bragato, Tullia Fontana, Marco Giordani, Malte Schellmann, Josef Eichinger, Michele Zorzi
Title: Network-Aware Control of AGVs in an Industrial Scenario: A Simulation Study Based on ROS 2 and Gazebo
Abstract:
Networked Control System (NCS) is a paradigm where sensors, controllers, and actuators communicate over a shared network. One promising application of NCS is the control of Automated Guided Vehicles (AGVs) in the industrial environment, for example to transport goods efficiently and to autonomously follow predefined paths or routes. In this context, communication and control are tightly correlated, a paradigm referred to as Joint Communication and Control (JCC), since network issues such as delays or errors can lead to significant deviations of the AGVs from the planned trajectory. In this paper, we present a simulation framework based on Gazebo and Robot Operating System 2 (ROS 2) to simulate and visualize, respectively, the complex interaction between the control of AGVs and the underlying communication network. This framework explicitly incorporates communication metrics, such as delay and packet loss, and control metrics, especially the Mean Squared Error (MSE) between the optimal/desired and actual path of the AGV in response to driving commands. Our results shed light into the correlation between the network performance, particularly Packet Reception Ratio (PRR), and accuracy of control.
中文摘要:本文提出基于Gazebo和ROS2的仿真框架,用于模拟工业自动导引车系统中通信与控制的内在关联,揭示了网络数据包接收率等性能指标对车辆路径跟踪精度的直接影响。
English Summary: This paper introduces a simulation framework using Gazebo and ROS 2 to model the joint communication and control interactions in networked AGV systems, demonstrating how network performance metrics like packet loss directly impact vehicle path accuracy.

Authors:Chengkai Xu, Jiaqi Liu, Yicheng Guo, Peng Hang, Jian Sun
Title: A Knowledge-Driven Diffusion Policy for End-to-End Autonomous Driving Based on Expert Routing
Abstract:
End-to-end autonomous driving remains constrained by the need to generate multi-modal actions, maintain temporal stability, and generalize across diverse scenarios. Existing methods often collapse multi-modality, struggle with long-horizon consistency, or lack modular adaptability. This paper presents KDP, a knowledge-driven diffusion policy that integrates generative diffusion modeling with a sparse mixture-of-experts routing mechanism. The diffusion component generates temporally coherent and multi-modal action sequences, while the expert routing mechanism activates specialized and reusable experts according to context, enabling modular knowledge composition. Extensive experiments across representative driving scenarios demonstrate that KDP achieves consistently higher success rates, reduced collision risk, and smoother control compared to prevailing paradigms. Ablation studies highlight the effectiveness of sparse expert activation and the Transformer backbone, and activation analyses reveal structured specialization and cross-scenario reuse of experts. These results establish diffusion with expert routing as a scalable and interpretable paradigm for knowledge-driven end-to-end autonomous driving.
中文: 本文提出KDP,一种知识驱动的扩散策略,通过结合生成扩散模型与稀疏专家混合路由机制,提升了自动驾驶的适应性决策和泛化能力,在多种场景下实现了更高的安全性和控制流畅性。
English: The paper introduces KDP, a knowledge-driven diffusion policy that combines generative diffusion modeling with a sparse mixture-of-experts routing to enhance adaptive decision-making and generalization in autonomous driving, achieving superior performance in safety and control across diverse scenarios.

Authors:Chengkai Xu, Jiaqi Liu, Yicheng Guo, Peng Hang, Jian Sun
Title: A Knowledge-Driven Diffusion Policy for End-to-End Autonomous Driving Based on Expert Routing
Abstract:
End-to-end autonomous driving remains constrained by the difficulty of producing adaptive, robust, and interpretable decision-making across diverse scenarios. Existing methods often collapse diverse driving behaviors, lack long-horizon consistency, or require task-specific engineering that limits generalization. This paper presents KDP, a knowledge-driven diffusion policy that integrates generative diffusion modeling with a sparse mixture-of-experts routing mechanism. The diffusion component generates temporally coherent action sequences, while the expert routing mechanism activates specialized and reusable experts according to context, enabling modular knowledge composition. Extensive experiments across representative driving scenarios demonstrate that KDP achieves consistently higher success rates, reduced collision risk, and smoother control compared to prevailing paradigms. Ablation studies highlight the effectiveness of sparse expert activation and the Transformer backbone, and activation analyses reveal structured specialization and cross-scenario reuse of experts. These results establish diffusion with expert routing as a scalable and interpretable paradigm for knowledge-driven end-to-end autonomous driving.
中文: 本文提出KDP,一种知识驱动的扩散策略,通过结合生成扩散模型与稀疏专家混合路由机制,提升了自动驾驶的适应性决策和泛化能力,在多种场景下实现了更高的安全性和控制流畅性。
English: The paper introduces KDP, a knowledge-driven diffusion policy that combines generative diffusion modeling with a sparse mixture-of-experts routing to enhance adaptive decision-making and generalization in autonomous driving, achieving superior performance in safety and control across diverse scenarios.

Authors:Shengyin Sun, Yiming Li, Xing Li, Yingzhao Lian, Weizhe Lin, Hui-Ling Zhen, Zhiyuan Yang, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma
Title: Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling
Abstract:
Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.
中文摘要:测试时扩展虽能增强大语言模型的推理能力,但会因重复计算导致效率低下,而结合n-元文法推测解码方法可有效捕捉重复模式,在加速推理的同时兼顾多样性路径的平衡处理。
English Summary: Test-time scaling enhances LLM reasoning but suffers from inefficiency due to redundant computations, which can be effectively accelerated by integrating n-gram-based speculative decoding methods that capture repetitive patterns while balancing diverse reasoning paths.

Authors:Jimin Xu, Bosheng Qin, Tao Jin, Zhou Zhao, Zhenhui Ye, Jun Yu, Fei Wu
Title: SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer
Abstract:
Recent advancements in neural representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have increased interest in applying style transfer to 3D scenes. While existing methods can transfer style patterns onto 3D-consistent neural representations, they struggle to effectively extract and transfer high-level style semantics from the reference style image. Additionally, the stylized results often lack structural clarity and separation, making it difficult to distinguish between different instances or objects within the 3D scene. To address these limitations, we propose a novel 3D style transfer pipeline that effectively integrates prior knowledge from pretrained 2D diffusion models. Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints. Then, we transfer the stylized key views onto the 3D representation. This process incorporates two innovative designs. The first is cross-view style alignment, which inserts cross-view attention into the last upsampling block of the UNet, allowing feature interactions across multiple key views. This ensures that the diffusion model generates stylized key views that maintain both style fidelity and instance-level consistency. The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation. This results in a more structured, visually coherent, and artistically enriched stylization. Extensive qualitative and quantitative experiments demonstrate that our 3D style transfer pipeline significantly outperforms state-of-the-art methods across a wide range of scenes, from forward-facing to challenging 360-degree environments. Visit our project page https://jm-xu.github.io/SSGaussian for immersive visualization.
Chinese: 针对现有三维风格迁移方法难以提取高级语义和保持结构清晰的问题,本研究提出创新流程,通过扩散先验的跨视角对齐和实例级迁移技术,实现了在多种场景中显著优化的风格化效果。
English: Recent advances in 3D style transfer struggle with extracting high-level semantics and maintaining structural clarity, prompting the development of a novel pipeline that leverages diffusion priors through cross-view alignment and instance-level transfer to achieve superior stylization.

Authors:Amirmohammad Farzaneh, Matteo Zecchin, Osvaldo Simeone
Title: Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference
Abstract:
This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.
中文摘要:本文提出SP-CCI新框架,通过引入合成数据改进保形反事实推理,在保持边际覆盖保证的同时生成更紧凑的预测区间。
English Summary: This paper introduces SP-CCI, a novel framework that enhances conformal counterfactual inference by incorporating synthetic data to generate tighter prediction intervals while maintaining marginal coverage guarantees.

Authors:Alessandro Traspadini, Michele Zorzi, Marco Giordani
Title: Performance Evaluation of LoRa for IoT Applications in Non-Terrestrial Networks via ns-3
Abstract:
The integration of Internet of Things (IoT) and Non-Terrestrial Networks (NTNs) has emerged as a key paradigm to provide connectivity for sensors and actuators via satellite gateways in remote areas where terrestrial infrastructure is limited or unavailable. Among other Low-Power Wide-Area Network (LPWAN) technologies for IoT, Long Range (LoRa) holds great potential given its long range, energy efficiency, and flexibility. In this paper, we explore the feasibility and performance of LoRa to support large-scale IoT connectivity through Low Earth Orbit (LEO) satellite gateways. To do so, we developed a new ns3-LoRa-NTN simulation module, which integrates and extends the ns3-LoRa and ns3-NTN modules, to enable full-stack end-to-end simulation of satellite communication in LoRa networks. Our results, given in terms of average data rate and Packet Reception Ratio (PRR), confirm that LoRa can effectively support direct communication from the ground to LEO satellites, but network optimization is required to mitigate collision probability when end nodes use the same Spreading Factors (SFs) over long distances.
中文摘要:研究表明,LoRa技术能够有效支持地面设备与低轨卫星间的直接通信以实现大规模物联网连接,但在长距离使用相同扩频因子时需进行网络优化以降低冲突概率。
English Summary: This study demonstrates that LoRa technology can effectively enable direct communication between ground devices and LEO satellites for large-scale IoT connectivity, though network optimization is necessary to reduce collision risks when using identical Spreading Factors over long distances.

Authors:Chen Sun, Wenqi Zhang, Bizhu Wang, Xiaodong Xu, Chau Yuen, Yan Zhang, Ping Zhang
Title: Know What, Know Why: Semantic Hazard Communication for Intelligent V2X Systems
Abstract:
In current vehicle-to-everything (V2X) communication systems, roadside units (RSUs) broadcast brief warning messages that alert nearby vehicles to avoid potential hazards. However, these messages lack contextual information on why a warning is issued, leading to excessive caution or inefficient driving behaviors. To avoid such a situation, we propose a semantic-enhanced and explainable V2X (SEE-V2X) system. In the proposed system, RSUs equipped with smart cameras detect obstructions and transmit context-aware messages to vehicles. By understanding both what the hazard is and why it occurs, drivers can make more intelligent decisions based on their specific driving situation. Furthermore, through a real-field demonstration, we show the new "see-through" feature in the proposed system, which enables drivers to visualize hidden pedestrians behind obstacles. We also perform simulations to compare traditional V2X with SEE-V2X under different traffic conditions. The results show that SEE-V2X significantly improves traffic efficiency and reduces unnecessary deceleration.
中文: 提出的SEE-V2X系统通过路边单元搭载的智能摄像头提供情境感知预警,使驾驶员能够可视化隐藏危险并做出明智决策,从而显著提升交通效率并减少不必要的减速行为。
English: The proposed SEE-V2X system enhances traditional V2X communication by providing context-aware warnings through smart cameras on RSUs, enabling drivers to visualize hidden hazards and make informed decisions, which significantly improves traffic efficiency and reduces unnecessary braking.

Authors:Tianyue Zheng, Jiajia Guo, Linglong Dai, Shi Jin, Jun Zhang
Title: MUSE-FM: Multi-task Environment-aware Foundation Model for Wireless Communications
Abstract:
Recent advancements in foundation models (FMs) have attracted increasing attention in the wireless communication domain. Leveraging the powerful multi-task learning capability, FMs hold the promise of unifying multiple tasks of wireless communication with a single framework. with a single framework. Nevertheless, existing wireless FMs face limitations in the uniformity to address multiple tasks with diverse inputs/outputs across different communication scenarios.In this paper, we propose a MUlti-taSk Environment-aware FM (MUSE-FM) with a unified architecture to handle multiple tasks in wireless communications, while effectively incorporating scenario information.Specifically, to achieve task uniformity, we propose a unified prompt-guided data encoder-decoder pair to handle data with heterogeneous formats and distributions across different tasks. Besides, we integrate the environmental context as a multi-modal input, which serves as prior knowledge of environment and channel distributions and facilitates cross-scenario feature extraction. Simulation results illustrate that the proposed MUSE-FM outperforms existing methods for various tasks, and its prompt-guided encoder-decoder pair improves the scalability for new task configurations. Moreover, the incorporation of environment information improves the ability to adapt to different scenarios.
基础模型在无线通信领域展现出统一多任务的潜力,但现有方法难以处理跨场景的异构数据,因此提出MUSE-FM框架,通过统一提示引导架构和环境信息融合来提升任务性能与场景适应能力。
Foundation models (FMs) show potential for unifying wireless communication tasks but face limitations in handling diverse inputs and outputs across scenarios, leading to the proposal of MUSE-FM with a unified prompt-guided architecture and environmental context integration to improve performance and adaptability.

Authors:Xiucheng Wang, Qiming Zhang, Nan Cheng
Title: RadioDiff-Loc: Diffusion Model Enhanced Scattering Congnition for NLoS Localization with Sparse Radio Map Estimation
Abstract:
Accurate localization of non-cooperative signal sources in non-line-of-sight (NLoS) environments remains a critical challenge with a wide range of applications, including autonomous navigation, industrial automation, and emergency response. In such settings, traditional positioning techniques relying on line-of-sight (LoS) or cooperative signaling fail due to severe multipath propagation and unknown transmit power. This paper proposes a novel generative inference framework for NLoS localization based on conditional diffusion models. By leveraging the physical insight that diffracted electromagnetic energy concentrates near building edges, we develop a sampling strategy that collects sparse received signal strength (RSS) measurements at the geometric vertices of obstacles--locations that maximize Fisher information and mutual information with respect to the unknown source. To overcome the lack of known transmission power, we normalize all sampled RSS values relative to the maximum observed intensity, enabling the construction of a power-invariant radio map (RM). A conditional diffusion model is trained to reconstruct the full RM based on environmental layout and sparse RSS observations. Localization is then achieved by identifying the brightest point on the generated RM. Moreover, the proposed framework is compatible with existing RSS-based localization algorithms, enabling a dual-driven paradigm that fuses physical knowledge and data-driven inference for improved accuracy. Extensive theoretical analysis and empirical validation demonstrate that our approach achieves high localization accuracy with significantly reduced sampling cost, offering a scalable and physically grounded solution for non-cooperative NLoS emitter localization.
中文摘要:本文提出了一种基于条件扩散模型的生成式推理框架,通过利用建筑物边缘电磁衍射的物理特性采集稀疏RSS测量值,实现了非视距环境下非合作信号源的精准定位。
English Summary: This paper introduces a generative inference framework using conditional diffusion models to accurately localize non-cooperative signal sources in NLoS environments by leveraging sparse RSS measurements and physical knowledge of electromagnetic diffraction near building edges.

Authors:Xuhui Zhou, Jiarui Liu, Akhila Yerukola, Hyunwoo Kim, Maarten Sap
Title: Social World Models
Abstract:
Humans intuitively navigate social interactions by simulating unspoken dynamics and reasoning about others' perspectives, even with limited information. In contrast, AI systems struggle to automatically structure and reason about these implicit social contexts. In this paper, we introduce a novel structured social world representation formalism (S3AP), designed to help AI systems reason more effectively about social dynamics. Following a POMDP-driven design, S3AP represents social interactions as structured tuples, such as state, observation, agent actions, and mental states, which can be automatically induced from free-form narratives or other inputs. We first show S3AP can help LLMs better understand social narratives across 5 social reasoning tasks (e.g., +51% improvement on FANToM's theory-of-mind reasoning with OpenAI's o1), reaching new state-of-the-art (SOTA) performance. We then induce social world models from these structured representations, demonstrating their ability to predict future social dynamics and improve agent decision-making, yielding up to +18% improvement on the SOTOPIA social interaction benchmark. Our findings highlight the promise of S3AP as a powerful, general-purpose representation for social world states, enabling the development of more socially-aware systems that better navigate social interactions.
中文: 本文提出的S3AP结构化社交世界表征能有效提升AI对社交动态的推理能力,在多项社交推理任务中达到最优性能,并显著增强了智能体在社交互动中的决策表现。
English: This paper introduces S3AP, a structured social world representation that enhances AI's ability to reason about social dynamics, achieving state-of-the-art performance in social reasoning tasks and improving agent decision-making in social interactions.

Authors:Zhen Yang, Zi-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, Ram Ramrakhya, Chao Jia, Jeffrey Nichols, Alexander Toshev, Yinfei Yang, Zhe Gan
Title: Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Abstract:
Developing autonomous agents that effectively interact with Graphic User Interfaces (GUIs) remains a challenging open problem, especially for small on-device models. In this paper, we present Ferret-UI Lite, a compact, end-to-end GUI agent that operates across diverse platforms, including mobile, web, and desktop. Utilizing techniques optimized for developing small models, we build our 3B Ferret-UI Lite agent through curating a diverse GUI data mixture from real and synthetic sources, strengthening inference-time performance through chain-of-thought reasoning and visual tool-use, and reinforcement learning with designed rewards. Ferret-UI Lite achieves competitive performance with other small-scale GUI agents. In GUI grounding, Ferret-UI Lite attains scores of $91.6\%$, $53.3\%$, and $61.2\%$ on the ScreenSpot-V2, ScreenSpot-Pro, and OSWorld-G benchmarks, respectively. For GUI navigation, Ferret-UI Lite achieves success rates of $28.0\%$ on AndroidWorld and $19.8\%$ on OSWorld. We share our methods and lessons learned from developing compact, on-device GUI agents.
中文:Ferret-UI Lite是一款紧凑型3B参数GUI代理,通过优化训练技术和专用推理方法在多个平台上实现优异性能,在GUI定位和导航基准测试中均展现出强劲表现。
English: Ferret-UI Lite is a compact 3B model GUI agent that achieves competitive performance across multiple platforms through optimized training techniques and specialized inference methods, demonstrating strong results in GUI grounding and navigation benchmarks.

Authors:Xue Yan, Zijing Ou, Mengyue Yang, Yan Song, Haifeng Zhang, Yingzhen Li, Jun Wang
Title: Memory-Driven Self-Improvement for Decision Making with Large Language Models
Abstract:
Large language models (LLMs) have emerged as effective action policies for sequential decision-making (SDM) tasks due to their extensive prior knowledge. However, this broad yet general knowledge is often insufficient for specific decision-making tasks with limited task-related data, making it challenging to efficiently adapt LLMs to specific SDM tasks. To address this challenge, we propose a memory-driven self-improvement framework that combines LLM general prior knowledge with a compact memory of domain-specific experiences. Memory retains past interactions and associated Q-values, thereby capturing decision-relevant knowledge that facilitates accurate value estimation and informs the LLM prior refinement. The refined LLM prior, in turn, generates higher-reward trajectories that further enrich memory, forming a natural self-improvement framework where memory and LLM prior mutually reinforce each other. Experiments show that our memory-driven approach significantly outperforms both traditional RL and LLM-based baselines, e.g., improving performance by over 40\% on in-distribution tasks and over 75\% when generalized to unseen tasks in ALFWorld.
中文摘要:该研究提出了一种记忆驱动的自我改进框架,通过将大语言模型的通用知识与存储领域特定经验的记忆相结合,显著提升了序列决策任务的性能表现,在各类测试中优于传统强化学习和基于大语言模型的基线方法。
English Summary: The proposed memory-driven self-improvement framework enhances large language models for sequential decision-making by integrating their general knowledge with domain-specific experiences stored in memory, achieving significant performance improvements over traditional methods.

Authors:Teng Zhang, Ziqian Fan, Mingxin Liu, Xin Zhang, Xudong Lu, Wentong Li, Yue Zhou, Yi Yu, Xiang Li, Junchi Yan, Xue Yang
Title: Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization
Abstract:
Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM's poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.
中文: Point2RBox-v3通过渐进式标签分配和先验引导动态掩码损失,有效提升了伪标签的质量和利用率,在多类遥感数据集的定向目标检测中取得了领先性能。
English: Point2RBox-v3 introduces progressive label assignment and prior-guided dynamic mask loss to enhance pseudo label quality and utilization, achieving state-of-the-art performance in oriented object detection across diverse datasets.

Authors:Takumi Goto, Yusuke Sakai, Taro Watanabe
Title: Reliability Crisis of Reference-free Metrics for Grammatical Error Correction
Abstract:
Reference-free evaluation metrics for grammatical error correction (GEC) have achieved high correlation with human judgments. However, these metrics are not designed to evaluate adversarial systems that aim to obtain unjustifiably high scores. The existence of such systems undermines the reliability of automatic evaluation, as it can mislead users in selecting appropriate GEC systems. In this study, we propose adversarial attack strategies for four reference-free metrics: SOME, Scribendi, IMPARA, and LLM-based metrics, and demonstrate that our adversarial systems outperform the current state-of-the-art. These findings highlight the need for more robust evaluation methods.
Chinese: 无参考语法纠错评估指标虽与人工评判高度相关,但易受对抗系统人为刷分的影响,这凸显了开发更强健评估方法的迫切需求。
English: Reference-free GEC metrics, while correlating well with human judgment, are vulnerable to adversarial systems that artificially inflate scores, revealing a critical need for more robust evaluation methods.

Authors:Peini Yi, Wenchi Cheng, Jingqing Wang, Jinzhe Pan, Yuehui Ouyang, Wei Zhang
Title: Intelligent Multi-link EDCA Optimization for Delay-Bounded QoS in Wi-Fi 7
Abstract:
IEEE 802.11be (Wi-Fi 7) introduces Multi-Link Operation (MLO) as a While MLO offers significant parallelism and capacity, realizing its full potential in guaranteeing strict delay bounds and optimizing Quality of Service (QoS) for diverse, heterogeneous traffic streams in complex multi-link scenarios remain a significant challenge. This is largely due to the limitations of static Enhanced Distributed Channel Access (EDCA) parameters and the complexity inherent in cross-link traffic management. To address this, this paper investigates the correlation between overall MLO QoS indicators and the configuration of EDCA parameters and Acess Catagory (AC) traffic allocation among links. Based on this analysis, we formulate a constrained optimization problem aiming to minimize the sum of overall packet loss rates for all access categories while satisfying their respective overall delay violation probability constraints. A Genetic Algorithm (GA)-based MLO EDCA QoS optimization algorithm is designed to efficiently search the complex configuration space of AC assignments and EDCA parameters. Experimental results demonstrate that the proposed approach's efficacy in generating adaptive MLO configuration strategies that align with diverse service requirements. The proposed solution significantly improves delay distribution characteristics, and enhance QoS robustness and resource utilization efficiency in high-load MLO environments.
中文: 本文针对Wi-Fi 7多链路操作在复杂场景下保障服务质量面临的挑战,通过遗传算法优化EDCA参数和业务分配,有效改善了时延特性并提升了高负载环境下的资源利用效率。
English: IEEE 802.11be Wi-Fi 7's Multi-Link Operation faces challenges in ensuring strict QoS for diverse traffic, which this paper addresses by optimizing EDCA parameters and traffic allocation using a Genetic Algorithm to improve delay performance and resource utilization.

Authors:Yuxuan Wan, Tingshuo Liang, Jiakai Xu, Jingyu Xiao, Yintong Huo, Michael R. Lyu
Title: Automatically Generating Web Applications from Requirements Via Multi-Agent Test-Driven Development
Abstract:
Developing full-stack web applications is complex and time-intensive, demanding proficiency across diverse technologies and frameworks. Although recent advances in multimodal large language models (MLLMs) enable automated webpage generation from visual inputs, current solutions remain limited to front-end tasks and fail to deliver fully functional applications. In this work, we introduce TDDev, the first test-driven development (TDD)-enabled LLM-agent framework for end-to-end full-stack web application generation. Given a natural language description or design image, TDDev automatically derives executable test cases, generates front-end and back-end code, simulates user interactions, and iteratively refines the implementation until all requirements are satisfied. Our framework addresses key challenges in full-stack automation, including underspecified user requirements, complex interdependencies among multiple files, and the need for both functional correctness and visual fidelity. Through extensive experiments on diverse application scenarios, TDDev achieves a 14.4% improvement on overall accuracy compared to state-of-the-art baselines, demonstrating its effectiveness in producing reliable, high-quality web applications without requiring manual intervention.
中文: TDDev首创了测试驱动的LLM智能体框架,能够根据自然语言或设计图自动生成全栈网页应用,通过迭代测试与优化实现了比现有方法高14.4%的整体准确率。
English: TDDev introduces a test-driven LLM-agent framework that automatically generates complete full-stack web applications from natural language or images, achieving 14.4% higher accuracy than existing methods through iterative testing and refinement.

Authors:Penghao Wu, Yushan Zhang, Haiwen Diao, Bo Li, Lewei Lu, Ziwei Liu
Title: Visual Jigsaw Post-Training Improves MLLMs
Abstract:
Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs' intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs. Project Page: https://penghao-wu.github.io/visual_jigsaw/
中文摘要:本文提出Visual Jigsaw这一自监督强化学习框架,通过让多模态大语言模型用自然语言重构被打乱的视觉序列来增强其视觉理解能力,在多种视觉模态上显著提升了细粒度感知、时序推理和空间理解能力。
English Summary: This paper introduces Visual Jigsaw, a self-supervised reinforcement learning framework that enhances multimodal language models' visual understanding by having them reconstruct shuffled visual sequences through natural language permutations, demonstrating significant improvements in perception and reasoning across multiple visual modalities.

Authors:Yunyang Ge, Xinhua Cheng, Chengshu Zhao, Xianyi He, Shenghai Yuan, Bin Lin, Bin Zhu, Li Yuan
Title: FlashI2V: Fourier-Guided Latent Shifting Prevents Conditional Image Leakage in Image-to-Video Generation
Abstract:
In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P. Github page: https://pku-yuangroup.github.io/FlashI2V/
中文: FlashI2V通过潜在偏移和傅里叶引导解决了图像到视频生成中的条件图像泄漏问题,仅用13亿参数就在域外数据上实现了最优的泛化能力和性能表现。
English: FlashI2V addresses conditional image leakage in Image-to-Video generation by introducing Latent Shifting and Fourier Guidance, achieving superior generalization and performance on out-of-domain data with only 1.3B parameters.

Authors:Yuxian Jiang, Yafu Li, Guanxu Chen, Dongrui Liu, Yu Cheng, Jing Shao
Title: Rethinking Entropy Regularization in Large Reasoning Models
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has shown great promise in enhancing the reasoning abilities of large reasoning models (LRMs). However, it suffers from a critical issue: entropy collapse and premature convergence. Naive entropy regularization, a common approach for encouraging exploration in the traditional RL literature, fails to address this problem in the context of LRM. Our analysis reveals that this failure stems from the vast action space and long trajectories in LRMs, which easily trigger a global entropy explosion as the model indiscriminately explores all possible actions and states. To address this, we propose SIREN (SelectIve entRopy rEgularizatioN), a method that confines exploration to a meaningful subset of actions and states. SIREN achieves this through a two-step entropy masking mechanism, consisting of a top-p mask and a peak-entropy mask. In addition, regularization is transformed into a self-anchored form to stabilize training. Across five mathematical benchmarks, SIREN attains superior average performance over previous entropy-related RLVR approaches, exemplified by a +6.6 maj@k improvement on AIME24/25 with Qwen2.5-Math-7B. Further analysis confirms that SIREN promotes greater response diversity and maintains entropy at an appropriate level, which helps to preserve the validation pass@k throughout training. This effectively mitigates the premature convergence problem common in RLVR for LRM.
中文: SIREN通过选择性熵正则化解决了大型推理模型中强化学习可验证奖励的熵崩溃问题,采用两步掩码机制限制探索范围并结合自锚定正则化稳定训练,从而在数学基准测试中取得更优性能。
English: SIREN introduces selective entropy regularization to prevent entropy collapse in reinforcement learning with verifiable rewards for large reasoning models, achieving superior performance by controlling exploration through a two-step masking mechanism and stabilizing training with self-anchored regularization.

Authors:Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang, Qingsong Wen, Zuozhu Liu, Sabato Marco Siniscalchi, Ming Jin, Shirui Pan
Title: TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Abstract:
Recent advances in multimodal time series learning underscore a paradigm shift from analytics centered on basic patterns toward advanced time series understanding and reasoning. However, existing multimodal time series datasets mostly remain at the level of surface alignment and question answering, without reaching the depth of genuine reasoning. The absence of well-defined tasks that genuinely require time series reasoning, along with the scarcity of high-quality data, has limited progress in building practical time series reasoning models (TSRMs). To this end, we introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series: (1) perception, acquired through scenario understanding and causality discovery; (2) extrapolation, realized via event-aware forecasting; and (3) decision-making, developed through deliberation over perception and extrapolation. TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs. It contains more than 23K samples, of which 2.3K are carefully curated through a human-guided hierarchical annotation process. Building on this foundation, we introduce TimeOmni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning. The model is trained in multiple stages, integrating a mixture of task scenarios, novel reward functions, and tailored optimizations. Experiments show that TimeOmni-1 delivers strong out-of-distribution generalization across all tasks and achieves a high rate of valid responses. It significantly improves causality discovery accuracy (64.0% vs. 35.9% with GPT-4.1) and raises the valid response rate by over 6% compared to GPT-4.1 on the event-aware forecasting task.
中文: 摘要介绍了时间序列推理套件(TSR-Suite),以解决多模态时间序列学习中缺乏真正推理任务的问题,包含四个原子任务和超过23K样本,并推出了TimeOmni-1统一模型,该模型在因果发现和预测方面相比GPT-4.1表现出更强的泛化能力和性能提升。
English: The abstract introduces the Time Series Reasoning Suite (TSR-Suite) to address the lack of genuine reasoning tasks in multimodal time series learning, featuring four atomic tasks and over 23K samples, and presents TimeOmni-1, a unified model that demonstrates strong generalization and improved performance in causality discovery and forecasting compared to GPT-4.1.

Authors:Xue-Feng Zhu, Tianyang Xu, Yifan Pan, Jinjie Gu, Xi Li, Jiwen Lu, Xiao-Jun Wu, Josef Kittler
Title: Collaborating Vision, Depth, and Thermal Signals for Multi-Modal Tracking: Dataset and Algorithm
Abstract:
Existing multi-modal object tracking approaches primarily focus on dual-modal paradigms, such as RGB-Depth or RGB-Thermal, yet remain challenged in complex scenarios due to limited input modalities. To address this gap, this work introduces a novel multi-modal tracking task that leverages three complementary modalities, including visible RGB, Depth (D), and Thermal Infrared (TIR), aiming to enhance robustness in complex scenarios. To support this task, we construct a new multi-modal tracking dataset, coined RGBDT500, which consists of 500 videos with synchronised frames across the three modalities. Each frame provides spatially aligned RGB, depth, and thermal infrared images with precise object bounding box annotations. Furthermore, we propose a novel multi-modal tracker, dubbed RDTTrack. RDTTrack integrates tri-modal information for robust tracking by leveraging a pretrained RGB-only tracking model and prompt learning techniques. In specific, RDTTrack fuses thermal infrared and depth modalities under a proposed orthogonal projection constraint, then integrates them with RGB signals as prompts for the pre-trained foundation tracking model, effectively harmonising tri-modal complementary cues. The experimental results demonstrate the effectiveness and advantages of the proposed method, showing significant improvements over existing dual-modal approaches in terms of tracking accuracy and robustness in complex scenarios.
中文:本文提出了一种新颖的多模态目标跟踪方法,融合RGB、深度和热红外三种互补模态,通过构建RGBDT500数据集和开发RDTTrack跟踪器,利用正交投影约束和提示学习技术,在复杂场景中显著提升了跟踪的准确性和鲁棒性。
English: This paper introduces a novel multi-modal object tracking approach using RGB, Depth, and Thermal Infrared modalities, supported by the RGBDT500 dataset and RDTTrack method, which significantly enhances tracking robustness in complex scenarios through tri-modal fusion and prompt learning.

Authors:Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello
Title: A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity
Abstract:
Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.
中文: 本文提出TRIANGLE这一新型相似度度量方法,通过三角形面积相似性改进三种模态的联合对齐,在视频-文本检索等任务中取得领先性能,并将基于余弦相似度的方法在Recall@1指标上提升高达9个百分点。
English: This paper introduces TRIANGLE, a novel similarity measure that enhances multimodal learning by improving the joint alignment of three modalities through triangle-area similarity, achieving state-of-the-art results in tasks like video-text retrieval and boosting performance by up to 9 points in Recall@1.

Authors:Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie
Title: SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
Abstract:
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.
中文: SANA-Video 是一种小型扩散模型,通过线性DiT和恒定内存KV缓存技术,能够高效生成高分辨率、长时长且文本对齐度高的视频,在消费级GPU上实现低成本快速部署。
English: SANA-Video is a small diffusion model that efficiently produces high-resolution, long-duration videos with strong text alignment, leveraging Linear DiT and constant-memory KV cache for fast, cost-effective generation deployable on consumer GPUs.

Authors:Eleonora Grassucci, Giuliano Galadini, Giordano Cicchetti, Aurelio Uncini, Fabio Antonacci, Danilo Comminiello
Title: Training-Free Multimodal Guidance for Video to Audio Generation
Abstract:
Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.
Chinese: 本研究提出了一种无需训练的多模态引导机制,用于视频到音频生成,无需重新训练模型即可提升视频、音频和文本之间的感知质量与跨模态对齐效果。
English: This study introduces a training-free multimodal guidance mechanism for video-to-audio generation that enhances perceptual quality and alignment across video, audio, and text without requiring model retraining.

Authors:Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello
Title: Semantic Compression via Multimodal Representation Learning
Abstract:
Multimodal representation learning produces high-dimensional embeddings that align diverse modalities in a shared latent space. While this enables strong generalization, it also introduces scalability challenges, both in terms of storage and downstream processing. A key open problem is how to achieve semantic compression, reducing the memory footprint of multimodal embeddings while preserving their ability to represent shared semantic content across modalities. In this paper, we prove a strong connection between reducing the modality gap, which is the residual separation of embeddings from different modalities, and the feasibility of post-training semantic compression. When the gap is sufficiently reduced, embeddings from different modalities but expressing the same semantics share a common portion of the space. Therefore, their centroid is a faithful representation of such a semantic concept. This enables replacing multiple embeddings with a single centroid, yielding significant memory savings. We propose a novel approach for semantic compression grounded on the latter intuition, operating directly on pretrained encoders. We demonstrate its effectiveness across diverse large-scale multimodal downstream tasks. Our results highlight that modality alignment is a key enabler for semantic compression, showing that the proposed approach achieves significant compression without sacrificing performance.
中文摘要:本文证明减小模态间隙可实现语义压缩,通过用单个质心替代多个嵌入,在多模态任务中实现显著内存节省且不损失性能。
English Summary: This paper demonstrates that reducing the modality gap enables semantic compression by replacing multiple embeddings with a single centroid, achieving significant memory savings without performance loss in multimodal tasks.

Authors:Runmin Jiang, Wanyue Feng, Yuntian Yang, Shriya Pingulkar, Hong Wang, Xi Xiao, Xiaoyu Cao, Genpei Zhang, Xiao Wang, Xiaolong Wu, Tianyang Wang, Yang Liu, Xingjian Li, Min Xu
Title: Towards Foundation Models for Cryo-ET Subtomogram Analysis
Abstract:
Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular structures, where subtomogram analysis tasks such as classification, alignment, and averaging are critical for structural determination. However, effective analysis is hindered by scarce annotations, severe noise, and poor generalization. To address these challenges, we take the first step towards foundation models for cryo-ET subtomograms. First, we introduce CryoEngine, a large-scale synthetic data generator that produces over 904k subtomograms from 452 particle classes for pretraining. Second, we design an Adaptive Phase Tokenization-enhanced Vision Transformer (APT-ViT), which incorporates adaptive phase tokenization as an equivariance-enhancing module that improves robustness to both geometric and semantic variations. Third, we introduce a Noise-Resilient Contrastive Learning (NRCL) strategy to stabilize representation learning under severe noise conditions. Evaluations across 24 synthetic and real datasets demonstrate state-of-the-art (SOTA) performance on all three major subtomogram tasks and strong generalization to unseen datasets, advancing scalable and robust subtomogram analysis in cryo-ET.
中文: 本研究提出CryoEngine大规模合成数据生成器,并开发具备自适应相位标记化和抗噪对比学习的APT-ViT模型,成功解决冷冻电镜断层扫描子断层图分析中的标注稀缺、噪声干扰和泛化能力不足等难题,在多项测试中均取得最优性能。
English: This study introduces CryoEngine, a large-scale synthetic data generator, and develops APT-ViT with adaptive phase tokenization and noise-resilient contrastive learning to overcome annotation scarcity, noise, and generalization issues in cryo-ET subtomogram analysis, achieving state-of-the-art performance across diverse datasets.

Authors:Vignesh Ramanathan, Michael Milford, Tobias Fischer
Title: Prepare for Warp Speed: Sub-millisecond Visual Place Recognition Using Event Cameras
Abstract:
Visual Place Recognition (VPR) enables systems to identify previously visited locations within a map, a fundamental task for autonomous navigation. Prior works have developed VPR solutions using event cameras, which asynchronously measure per-pixel brightness changes with microsecond temporal resolution. However, these approaches rely on dense representations of the inherently sparse camera output and require tens to hundreds of milliseconds of event data to predict a place. Here, we break this paradigm with Flash, a lightweight VPR system that predicts places using sub-millisecond slices of event data. Our method is based on the observation that active pixel locations provide strong discriminative features for VPR. Flash encodes these active pixel locations using efficient binary frames and computes similarities via fast bitwise operations, which are then normalized based on the relative event activity in the query and reference frames. Flash improves Recall@1 for sub-millisecond VPR over existing baselines by 11.33x on the indoor QCR-Event-Dataset and 5.92x on the 8 km Brisbane-Event-VPR dataset. Moreover, our approach reduces the duration for which the robot must operate without awareness of its position, as evidenced by a localization latency metric we term Time to Correct Match (TCM). To the best of our knowledge, this is the first work to demonstrate sub-millisecond VPR using event cameras.
中文:Flash是一种轻量级视觉位置识别系统,它利用事件相机中的活跃像素位置生成二进制帧,实现了亚毫秒级的地点识别,相比现有方法将召回率提升5倍以上,同时显著降低了定位延迟。
English: Flash is a lightweight Visual Place Recognition system that achieves sub-millisecond place identification using binary frames of active pixel locations from event cameras, improving recall rates by over 5x compared to existing methods while reducing localization latency.

Authors:Qiushui Xu, Yuhao Huang, Yushu Jiang, Lei Song, Jinyu Wang, Wenliang Zheng, Jiang Bian
Title: In-Context Compositional Q-Learning for Offline Reinforcement Learning
Abstract:
Accurately estimating the Q-function is a central challenge in offline reinforcement learning. However, existing approaches often rely on a single global Q-function, which struggles to capture the compositional nature of tasks involving diverse subtasks. We propose In-context Compositional Q-Learning (\texttt{ICQL}), the first offline RL framework that formulates Q-learning as a contextual inference problem, using linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels. Theoretically, we show that under two assumptions--linear approximability of the local Q-function and accurate weight inference from retrieved context--\texttt{ICQL} achieves bounded Q-function approximation error, and supports near-optimal policy extraction. Empirically, \texttt{ICQL} substantially improves performance in offline settings: improving performance in kitchen tasks by up to 16.4\%, and in Gym and Adroit tasks by up to 8.6\% and 6.3\%. These results highlight the underexplored potential of in-context learning for robust and compositional value estimation, positioning \texttt{ICQL} as a principled and effective framework for offline RL.
中文: 提出的上下文组合Q学习(ICQL)框架通过线性Transformer进行上下文推理自适应推断局部Q函数,在多种任务中实现显著性能提升,同时为有界近似误差提供理论保证,有效解决了离线强化学习中的核心难题。
English: The proposed In-context Compositional Q-Learning (ICQL) framework addresses offline reinforcement learning challenges by adaptively inferring local Q-functions through contextual inference with linear Transformers, achieving significant performance improvements across various tasks while providing theoretical guarantees for bounded approximation error.

Authors:Guanxu Chen, Yafu Li, Yuxian Jiang, Chen Qian, Qihan Ren, Jingyi Yang, Yu Cheng, Dongrui Liu, Jing Shao
Title: Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) for large language models (LLMs) has achieved remarkable progress in enhancing LLMs' reasoning capabilities on tasks with clear correctness criteria, such as mathematical reasoning tasks. Several training metrics, such as entropy or response length, have been observed to correlate with different reasoning behaviors in reinforcement learning. Prior approaches incorporate such priors through reward or advantage shaping, which often relies on hand-crafted penalties and preferences (e.g., higher-is-better or lower-is-better). However, without careful hyperparameter tuning, these directional priors can be overly biased and may lead to failure. To this end, we introduce Conditional advANtage estimatiON (CANON), amplifying the impact of the target metric without presuming its direction. Specifically, CANON regroups the sampled responses into two groups based on the higher or lower value of a target metric, measures which metric trend contributes to better performance through inter-group comparison, and identifies the better response within the same group. In summary, CANON based on entropy consistently outperforms prior methods across three LLMs on both math reasoning and high-complexity logic tasks. When applied to response length, CANON further improves token efficiency, yielding a more favorable Pareto frontier in the performance-cost trade-off.
中文摘要:CANON方法通过无方向性地动态评估目标指标,强化了大型语言模型的推理能力,在数学推理和复杂逻辑任务中持续优于现有方法,同时提升了标记效率。
English Summary: The proposed CANON method enhances reinforcement learning for large language models by dynamically evaluating target metrics without directional bias, consistently outperforming prior approaches in mathematical reasoning and complex logic tasks while improving token efficiency.

Authors:Bo Li, Xin Zheng, Ming Jin, Can Wang, Shirui Pan
Title: Test-time GNN Model Evaluation on Dynamic Graphs
Abstract:
Dynamic graph neural networks (DGNNs) have emerged as a leading paradigm for learning from dynamic graphs, which are commonly used to model real-world systems and applications. However, due to the evolving nature of dynamic graph data distributions over time, well-trained DGNNs often face significant performance uncertainty when inferring on unseen and unlabeled test graphs in practical deployment. In this case, evaluating the performance of deployed DGNNs at test time is crucial to determine whether a well-trained DGNN is suited for inference on an unseen dynamic test graph. In this work, we introduce a new research problem: DGNN model evaluation, which aims to assess the performance of a specific DGNN model trained on observed dynamic graphs by estimating its performance on unseen dynamic graphs during test time. Specifically, we propose a Dynamic Graph neural network Evaluator, dubbed DyGEval, to address this new problem. The proposed DyGEval involves a two-stage framework: (1) test-time dynamic graph simulation, which captures the training-test distributional differences as supervision signals and trains an evaluator; and (2) DyGEval development and training, which accurately estimates the performance of the well-trained DGNN model on the test-time dynamic graphs. Extensive experiments demonstrate that the proposed DyGEval serves as an effective evaluator for assessing various DGNN backbones across different dynamic graphs under distribution shifts.
中文: DyGEval 是一个新颖的两阶段框架,通过模拟分布变化并训练评估器,旨在准确评估动态图神经网络在未见测试图上的性能表现。
English: DyGEval is a novel two-stage framework designed to evaluate the performance of dynamic graph neural networks on unseen test graphs by simulating distribution shifts and training an evaluator to accurately estimate model efficacy.

Authors:Xuyan Ma, Xiaofei Xie, Yawen Wang, Junjie Wang, Boyu Wu, Mingyang Li, Qing Wang
Title: Diagnosing Failure Root Causes in Platform-Orchestrated Agentic Systems: Dataset, Taxonomy, and Benchmark
Abstract:
Agentic systems consisting of multiple LLM-driven agents coordinating through tools and structured interactions, are increasingly deployed for complex reasoning and problem-solving tasks. At the same time, emerging low-code and template-based agent development platforms (e.g., Dify) enable users to rapidly build and orchestrate agentic systems, which we refer to as platform-orchestrated agentic systems. However, these systems are also fragile and it remains unclear how to systematically identify their potential failure root cause. This paper presents a study of root cause identification of these platform-orchestrated agentic systems. To support this initiative, we construct a dataset AgentFail containing 307 failure logs from ten agentic systems, each with fine-grained annotations linking failures to their root causes. We additionally utilize counterfactual reasoning-based repair strategy to ensure the reliability of the annotation. Building on the dataset, we develop a taxonomy that characterizes failure root causes and analyze their distribution across different platforms and task domains. Furthermore, we introduce a benchmark that leverages LLMs for automatically identifying root causes, in which we also utilize the proposed taxonomy as guidance for LLMs. Results show that the taxonomy can largely improve the performance, thereby confirming its utility. Nevertheless, the accuracy of root cause identification reaches at most 33.6%, which indicates that this task still remains challenging. In light of these results, we also provide actionable guidelines for building such agentic systems. In summary, this paper provides a reliable dataset of failure root cause for platform-orchestrated agentic systems, corresponding taxonomy and benchmark, which serves as a foundation for advancing the development of more reliable agentic systems.
Chinese: 本研究通过构建带标注故障日志的AgentFail数据集、开发故障分类体系并建立基准测试,探究平台编排智能体系统的故障根源识别,结果显示当前自动识别方法准确率最高仅33.6%,仍面临重大挑战。
English: This study investigates root cause identification in platform-orchestrated agentic systems by creating the AgentFail dataset with annotated failure logs, developing a taxonomy to classify failures, and establishing a benchmark that shows current automated identification methods remain challenging with only 33.6% accuracy.

Authors:Zhiqiang Liu, Yichi Zhang, Mengshu Sun, Lei Liang, Wen Zhang
Title: Collaboration of Fusion and Independence: Hypercomplex-driven Robust Multi-Modal Knowledge Graph Completion
Abstract:
Multi-modal knowledge graph completion (MMKGC) aims to discover missing facts in multi-modal knowledge graphs (MMKGs) by leveraging both structural relationships and diverse modality information of entities. Existing MMKGC methods follow two multi-modal paradigms: fusion-based and ensemble-based. Fusion-based methods employ fixed fusion strategies, which inevitably leads to the loss of modality-specific information and a lack of flexibility to adapt to varying modality relevance across contexts. In contrast, ensemble-based methods retain modality independence through dedicated sub-models but struggle to capture the nuanced, context-dependent semantic interplay between modalities. To overcome these dual limitations, we propose a novel MMKGC method M-Hyper, which achieves the coexistence and collaboration of fused and independent modality representations. Our method integrates the strengths of both paradigms, enabling effective cross-modal interactions while maintaining modality-specific information. Inspired by ``quaternion'' algebra, we utilize its four orthogonal bases to represent multiple independent modalities and employ the Hamilton product to efficiently model pair-wise interactions among them. Specifically, we introduce a Fine-grained Entity Representation Factorization (FERF) module and a Robust Relation-aware Modality Fusion (R2MF) module to obtain robust representations for three independent modalities and one fused modality. The resulting four modality representations are then mapped to the four orthogonal bases of a biquaternion (a hypercomplex extension of quaternion) for comprehensive modality interaction. Extensive experiments indicate its state-of-the-art performance, robustness, and computational efficiency.
中文摘要:提出的M-Hyper方法通过四元数启发的建模实现了模态表示协同融合与独立并存,克服了多模态知识图谱补全中的现有局限,取得了领先性能。
English Summary: The proposed M-Hyper method overcomes limitations in multi-modal knowledge graph completion by enabling collaborative fusion and independence of modality representations through quaternion-inspired modeling, achieving state-of-the-art performance.

Authors:Qingren Yao, Ming Jin, Chengqi Zhang, Chao-Han Huck Yang, Jun Qi, Shirui Pan
Title: Estimating Time Series Foundation Model Transferability via In-Context Learning
Abstract:
Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training, yet fine-tuning remains critical for boosting performance in domains with limited public data. With the growing number of TSFMs, efficiently identifying the best model for downstream fine-tuning becomes increasingly challenging. In this work, we introduce TimeTic, a transferability estimation framework that recasts model selection as an in-context-learning problem: given observations on known (source) datasets, it predicts how a TSFM will perform after fine-tuning on a downstream (target) dataset. TimeTic flexibly organizes the observed model-data relationships as contextual information, allowing it to adapt seamlessly to various test-time scenarios. Leveraging the natural tabular structure formed by dataset meta-features, model characteristics, and fine-tuned performance, we employ tabular foundation models to serve as in-context learners. We further introduce a novel model characterization based on entropy evolution across model layers, capturing embedding-space distinctions and enabling TimeTic to generalize across arbitrary model sets. We establish a comprehensive benchmark for transferability estimation including 10 datasets, 10 foundation models, and 3 forecasting tasks. On this benchmark, TimeTic's estimation demonstrates strong alignment with actual fine-tuned performance for previously unseen datasets, achieving a mean rank correlation of approximately 0.6 and a 30% improvement compared to using zero-shot performance as the transferability score.
中文: TimeTic作为一个可迁移性评估框架,通过上下文学习预测时序基础模型在下游任务中的微调表现,相比零样本方法显著提升了模型选择效果。
English: TimeTic is a transferability estimation framework that uses in-context learning to predict time series foundation models' fine-tuning performance, achieving significantly better model selection than zero-shot approaches.

Authors:Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng
Title: LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Abstract:
We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.
Chinese: LLaVA-OneVision-1.5 推出了突破性的大型多模态模型系列,通过开放、高效且可复现的框架,在显著降低计算和财务成本的同时实现了顶尖性能。
English: LLaVA-OneVision-1.5 introduces a groundbreaking family of Large Multimodal Models that deliver top-tier performance while drastically cutting computational and financial expenses through an open, efficient, and reproducible framework.

Authors:Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng
Title: ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
Abstract:
While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like "re-watching" process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation \& Reasoning (O\&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks. Project Page: https://rewatch-r1.github.io
中文: ReWatch数据集通过提供多跳问题和基于视频的思维链数据,解决了复杂视频推理中的数据瓶颈问题,而开发的ReWatch-R1模型通过惩罚幻觉的新型RLVR框架实现了最先进的性能。
English: The ReWatch dataset addresses the data bottleneck in complex video reasoning by providing multi-hop questions and video-grounded Chain-of-Thought data, while the developed ReWatch-R1 model achieves state-of-the-art performance through a novel RLVR framework that penalizes hallucination.

Authors:Piotr Luszczek, Vijay Gadepally, LaToya Anderson, William Arcand, David Bestor, William Bergeron, Alex Bonn, Daniel J. Burrill, Chansup Byun, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Peter Michaleas, Guillermo Morales, Julia Mullen, Andrew Prout, Albert Reuther, Antonio Rosa, Charles Yee, Jeremy Kepner
Title: Performance and Numerical Aspects of Decompositional Factorizations with FP64 Floating-Point Emulation in INT8
Abstract:
Mixing precisions for performance has been an ongoing trend as the modern hardware accelerators started including new, and mostly lower-precision, data formats. The advantage of using them is a great potential of performance gain and energy savings. The disadvantage are the numerical issues not present in the standard-mandated floating-point formats. Split integer emulation of FP64 takes this to an extreme with the computation performed only by fixed-point tensor core units. We present the new issues the emulation faces for practical cases involving dense linear solver. We show extensive numerical tests indicating the effect of extended numerical range of matrix entries. We also scaled the input sizes to study the performance and numerical profiles on the NVIDIA Hopper GPUs.
中文: 现代硬件加速器越来越多地支持混合精度格式,虽能提升性能和节能,但会带来数值问题,尤其在基于NVIDIA Hopper GPU的FP64整数拆分仿真稠密线性求解器中更为突出。
English: Modern hardware accelerators increasingly support mixed precision formats, offering performance and energy benefits but introducing numerical challenges, particularly in dense linear solvers using split integer emulation of FP64 on NVIDIA Hopper GPUs.

Authors:Jinzhe Pan, Jingqing Wang, Yuehui Ouyang, Wenchi Cheng, Wei Zhang
Title: AI-Enhanced Distributed Channel Access for Collision Avoidance in Future Wi-Fi 8
Abstract:
The exponential growth of wireless devices and stringent reliability requirements of emerging applications demand fundamental improvements in distributed channel access mechanisms for unlicensed bands. Current Wi-Fi systems, which rely on binary exponential backoff (BEB), suffer from suboptimal collision resolution in dense deployments and persistent fairness challenges due to inherent randomness. This paper introduces a multi-agent reinforcement learning framework that integrates artificial intelligence (AI) optimization with legacy device coexistence. We first develop a dynamic backoff selection mechanism that adapts to real-time channel conditions through access deferral events while maintaining full compatibility with conventional CSMA/CA operations. Second, we introduce a fairness quantification metric aligned with enhanced distributed channel access (EDCA) principles to ensure equitable medium access opportunities. Finally, we propose a centralized training decentralized execution (CTDE) architecture incorporating neighborhood activity patterns as observational inputs, optimized via constrained multi-agent proximal policy optimization (MAPPO) to jointly minimize collisions and guarantee fairness. Experimental results demonstrate that our solution significantly reduces collision probability compared to conventional BEB while preserving backward compatibility with commercial Wi-Fi devices. The proposed fairness metric effectively eliminates starvation risks in heterogeneous scenarios.
中文摘要:本文提出一种多智能体强化学习框架,通过动态调整退避机制和引入新型公平性指标来优化Wi-Fi信道接入,在保持与现有设备兼容的同时显著降低了碰撞概率。
English Summary: This paper proposes a multi-agent reinforcement learning framework that optimizes Wi-Fi channel access by dynamically adjusting backoff mechanisms and introducing a novel fairness metric, significantly reducing collisions while maintaining compatibility with existing devices.

Authors:Chenrui Ma, Xi Xiao, Tianyang Wang, Xiao Wang, Yanning Shen
Title: Stochastic Interpolants via Conditional Dependent Coupling
Abstract:
Existing image generation models face critical challenges regarding the trade-off between computation and fidelity. Specifically, models relying on a pretrained Variational Autoencoder (VAE) suffer from information loss, limited detail, and the inability to support end-to-end training. In contrast, models operating directly in the pixel space incur prohibitive computational cost. Although cascade models can mitigate computational cost, stage-wise separation prevents effective end-to-end optimization, hampers knowledge sharing, and often results in inaccurate distribution learning within each stage. To address these challenges, we introduce a unified multistage generative framework based on our proposed Conditional Dependent Coupling strategy. It decomposes the generative process into interpolant trajectories at multiple stages, ensuring accurate distribution learning while enabling end-to-end optimization. Importantly, the entire process is modeled as a single unified Diffusion Transformer, eliminating the need for disjoint modules and also enabling knowledge sharing. Extensive experiments demonstrate that our method achieves both high fidelity and efficiency across multiple resolutions.
中文: 提出的基于条件依赖耦合的统一多阶段生成框架通过单一扩散变换器实现端到端优化,解决了图像生成中计算与保真度的权衡问题,在多种分辨率下均实现了高效与高质。
English: The proposed unified multistage generative framework using Conditional Dependent Coupling overcomes computational and fidelity trade-offs in image generation by enabling end-to-end optimization through a single Diffusion Transformer, achieving high efficiency and quality across resolutions.

Authors:Yuan Xu, Jiabing Yang, Xiaofeng Wang, Yixiang Chen, Zheng Zhu, Bowen Fang, Guan Huang, Xinze Chen, Yun Ye, Qiang Zhang, Peiyan Li, Xiangnan Wu, Kai Wang, Bing Zhan, Shuo Lu, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang
Title: EgoDemoGen: Novel Egocentric Demonstration Generation Enables Viewpoint-Robust Manipulation
Abstract:
Imitation learning based policies perform well in robotic manipulation, but they often degrade under *egocentric viewpoint shifts* when trained from a single egocentric viewpoint. To address this issue, we present **EgoDemoGen**, a framework that generates *paired* novel egocentric demonstrations by retargeting actions in the novel egocentric frame and synthesizing the corresponding egocentric observation videos with proposed generative video repair model **EgoViewTransfer**, which is conditioned by a novel-viewpoint reprojected scene video and a robot-only video rendered from the retargeted joint actions. EgoViewTransfer is finetuned from a pretrained video generation model using self-supervised double reprojection strategy. We evaluate EgoDemoGen on both simulation (RoboTwin2.0) and real-world robot. After training with a mixture of EgoDemoGen-generated novel egocentric demonstrations and original standard egocentric demonstrations, policy success rate improves **absolutely** by **+17.0%** for standard egocentric viewpoint and by **+17.7%** for novel egocentric viewpoints in simulation. On real-world robot, the **absolute** improvements are **+18.3%** and **+25.8%**. Moreover, performance continues to improve as the proportion of EgoDemoGen-generated demonstrations increases, with diminishing returns. These results demonstrate that EgoDemoGen provides a practical route to egocentric viewpoint-robust robotic manipulation.
中文: EgoDemoGen通过动作重定向和视频合成生成配对的新型自我中心演示,解决了模仿学习在视角变化下的性能退化问题,在真实机器人上实现了高达25.8%的绝对性能提升。
English: EgoDemoGen addresses imitation learning's vulnerability to egocentric viewpoint shifts by generating paired novel demonstrations through action retargeting and video synthesis, achieving absolute performance improvements of up to 25.8% on real robots.

Authors:Leonhard Grosse, Sara Saeidian, Mikael Skoglund, Tobias J. Oechtering
Title: Privacy Mechanism Design based on Empirical Distributions
Abstract:
Pointwise maximal leakage (PML) is a per-outcome privacy measure based on threat models from quantitative information flow. Privacy guarantees with PML rely on knowledge about the distribution that generated the private data. In this work, we propose a framework for PML privacy assessment and mechanism design with empirical estimates of this data-generating distribution. By extending the PML framework to consider sets of data-generating distributions, we arrive at bounds on the worst-case leakage within a given set. We use these bounds alongside large-deviation bounds from the literature to provide a method for obtaining distribution-independent $(\varepsilon,δ)$-PML guarantees when the data-generating distribution is estimated from available data samples. We provide an optimal binary mechanism, and show that mechanism design with this type of uncertainty about the data-generating distribution reduces to a linearly constrained convex program. Further, we show that optimal mechanisms designed for a distribution estimate can be used. Finally, we apply these tools to leakage assessment of the Laplace mechanism and the Gaussian mechanism for binary private data, and numerically show that the presented approach to mechanism design can yield significant utility increase compared to local differential privacy, while retaining similar privacy guarantees.
中文: 本研究提出了一个基于经验数据分布估计的点式最大泄露隐私评估框架,实现了分布无关的隐私保证和最优机制设计,相比本地差分隐私显著提升了效用。
English: This work introduces a framework for assessing pointwise maximal leakage (PML) privacy using empirical data distribution estimates, enabling distribution-independent guarantees and optimal mechanism design that significantly improves utility over local differential privacy.

Authors:Zhehao Dong, Xiaofeng Wang, Zheng Zhu, Yirui Wang, Yang Wang, Yukun Zhou, Boyuan Wang, Chaojun Ni, Runqi Ouyang, Wenkang Qin, Xinze Chen, Yun Ye, Guan Huang
Title: EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer
Abstract:
Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization. However, collecting large-scale real-world robot manipulation data across varied object appearances and environmental conditions remains prohibitively time-consuming and expensive. To overcome this bottleneck, we propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline. We introduce DreamTransfer, a diffusion Transformer-based framework for generating multi-view consistent, geometrically grounded embodied manipulation videos. DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility. Furthermore, we explore hybrid training with real and generated data, and introduce AdaMix, a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples. Extensive experiments show that videos generated by DreamTransfer significantly outperform prior video generation methods in multi-view consistency, geometric fidelity, and text-conditioning accuracy. Crucially, VLAs trained with generated data enable robots to generalize to unseen object categories and novel visual domains using only demonstrations from a single appearance. In real-world robotic manipulation tasks with zero-shot visual domains, our approach achieves over a 200% relative performance gain compared to training on real data alone, and further improves by 13% with AdaMix, demonstrating its effectiveness in boosting policy generalization.
中文摘要:EMMA框架通过DreamTransfer生成逼真的机器人操作视频并采用AdaMix训练策略,显著增强了视觉-语言-动作模型的泛化能力,在未知视觉领域中实现了超过200%的性能提升。
English Summary: The EMMA framework enhances vision-language-action models by generating realistic robot manipulation videos through DreamTransfer and employing AdaMix training strategy, significantly improving policy generalization with over 200% performance gain in unseen visual domains.

Authors:Haoyun Li, Ivan Zhang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Zhiqin Yang, Zhentao Zhang, Boyuan Wang, Chaojun Ni, Wenkang Qin, Xinze Chen, Yun Ye, Guan Huang, Zhenbo Song, Xingang Wang
Title: MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training
Abstract:
Vision Language Action (VLA) models derive their generalization capability from diverse training data, yet collecting embodied robot interaction data remains prohibitively expensive. In contrast, human demonstration videos are far more scalable and cost-efficient to collect, and recent studies confirm their effectiveness in training VLA models. However, a significant domain gap persists between human videos and robot-executed videos, including unstable camera viewpoints, visual discrepancies between human hands and robotic arms, and differences in motion dynamics. To bridge this gap, we propose MimicDreamer, a framework that turns fast, low-cost human demonstrations into robot-usable supervision by jointly aligning vision, viewpoint, and actions to directly support policy training. For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos by transferring motion from human manipulation footage. For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography and inpaints occlusions and distortions caused by warping. For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver to produce feasible, low-jitter joint commands with accurate pose tracking. Empirically, VLA models trained purely on our synthesized human-to-robot videos achieve few-shot execution on real robots. Moreover, scaling training with human data significantly boosts performance compared to models trained solely on real robot data; our approach improves the average success rate by 14.7\% across six representative manipulation tasks.
中文: MimicDreamer通过视觉、视角和动作的对齐,弥合了人类与机器人视频之间的领域差距,使基于合成人机转换数据训练的VLA模型在真实机器人任务中实现卓越性能,平均成功率提升14.7%。
English: MimicDreamer bridges the domain gap between human and robot videos by aligning vision, viewpoint, and actions, enabling VLA models trained on synthesized human-to-robot data to achieve superior real-world robot performance with a 14.7% average success rate improvement.

Authors:Justin Vasselli, Eunike Andriani Kardinata, Yusuke Sakai, Taro Watanabe
Title: Multilingual Dialogue Generation and Localization with Dialogue Act Scripting
Abstract:
Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.
中文摘要:本文提出对话行为脚本(DAS)框架,通过抽象意图生成多语言对话以规避翻译失真问题,人工评估显示在意大利语、德语和中文中,DAS在文化适配性与对话自然度上均优于翻译生成的结果。
English Summary: This paper introduces Dialogue Act Script (DAS), a framework that generates multilingual dialogues from abstract intents to overcome translation artifacts, with human evaluations showing DAS outperforms translation methods in cultural relevance and naturalness across three languages.

Authors:Brian B. Moser, Tobias C. Nauen, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Joachim Folz, Andreas Dengel
Title: SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection
Abstract:
The goal of coreset selection is to identify representative subsets of datasets for efficient model training. Yet, existing approaches paradoxically require expensive training-based signals, e.g., gradients, decision boundary estimates or forgetting counts, computed over the entire dataset prior to pruning, which undermines their very purpose by requiring training on samples they aim to avoid. We introduce SubZeroCore, a novel, training-free coreset selection method that integrates submodular coverage and density into a single, unified objective. To achieve this, we introduce a sampling strategy based on a closed-form solution to optimally balance these objectives, guided by a single hyperparameter that explicitly controls the desired coverage for local density measures. Despite no training, extensive evaluations show that SubZeroCore matches training-based baselines and significantly outperforms them at high pruning rates, while dramatically reducing computational overhead. SubZeroCore also demonstrates superior robustness to label noise, highlighting its practical effectiveness and scalability for real-world scenarios.
Chinese: SubZeroCore是一种无需训练的核集选择方法,它将子模覆盖和密度结合成统一目标,在性能上媲美甚至超越基于训练的方法,同时显著降低计算成本并增强对标签噪声的鲁棒性。
English: SubZeroCore is a training-free coreset selection method that combines submodular coverage and density into a unified objective, matching or surpassing training-based methods in performance while significantly reducing computational costs and enhancing robustness to label noise.

Authors:Brian B. Moser, Arundhati S. Shanbhag, Tobias C. Nauen, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel
Title: HyperCore: Coreset Selection under Noise via Hypersphere Models
Abstract:
The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.
中文: HyperCore是一种鲁棒的自适应核心集选择框架,利用超球面模型和尤登指数自动剔除噪声数据,在嘈杂和低数据场景下优于现有方法。
English: HyperCore is a robust, adaptive coreset selection framework that uses hypersphere models and Youden's J statistic to automatically prune noisy data, outperforming existing methods in noisy and low-data scenarios.

Authors:Yuhang Xu, Shengzhong Liu, Dong Zhang, Bingheng Yan, Fan Wu, Guihai Chen
Title: Nova: Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
Abstract:
This paper presents Nova, a real-time scheduling framework for serving agentic vision-language models (VLMs) on a single GPU with balanced per-request latency and overall request process throughput. Our design begins by enabling effective pipelining across vision encode, LLM prefill, and LLM decode stages of VLMs, by exploiting their heterogeneous resource demands during execution and incorporating elastic GPU spatial partitioning among stages to maximally utilize the compute and memory resources. Building on this, we introduce a real-time scheduling algorithm that adaptively calibrates resource allocation among stages based on a Pareto-optimal analysis of the latency-throughput trade-off, allowing the system to sustain responsiveness and resource efficiency under dynamic request loads. To further alleviate GPU memory pressure, we design a lightweight weight offloading strategy for vision encoders that preserves inference efficiency with minimized memory overhead. Extensive evaluations on both synthetic and real-world agent workloads demonstrate that Nova consistently outperforms the state-of-the-art baselines, improving the maximum latency by up to 23.3%, while keeping competitive throughput.
中文: Nova是一种实时调度框架,通过流水线处理、自适应资源分配和内存优化,在单个GPU上为视觉语言模型实现低延迟与高吞吐量的平衡服务。
English: Nova is a real-time scheduling framework that optimizes GPU resource allocation for agentic vision-language models, achieving balanced latency and throughput through pipelining, adaptive scheduling, and memory management.

Authors:Wei-Teng Chu, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi
Title: Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation
Abstract:
Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as \emph{NeuS} and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets.
本文提出了FINS框架,能够从单张图像快速重建高保真表面和SDF场,在机器人应用中相比现有方法实现了更优的速度与精度表现。
This paper introduces FINS, a fast and efficient framework that reconstructs high-fidelity surfaces and SDF fields from a single image, achieving superior speed and accuracy over existing methods in robotics applications.

Authors:Francesco Emanuele Stradi, Eleonora Fidelia Chiefari, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Title: Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints
Abstract:
We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $α$-regret with respect to the \emph{unconstrained} optimum, where $α$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.
Chinese: 本文提出了一种新颖的在线情景约束马尔可夫决策过程算法,无需依赖斯莱特条件即可在随机和对抗两种环境下实现次线性遗憾和约束违反,显著改进了现有最佳算法的性能表现。
English: This paper introduces a novel algorithm for online episodic Constrained Markov Decision Processes that significantly improves upon existing methods by achieving sublinear regret and constraint violation in both stochastic and adversarial settings without requiring Slater's condition.

Authors:Francesco Emanuele Stradi, Eleonora Fidelia Chiefari, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti
Title: Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints
Abstract:
We study \emph{online episodic Constrained Markov Decision Processes} (CMDPs) under both stochastic and adversarial constraints. We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al. (2025). In the stochastic regime, \emph{i.e.}, when the constraints are sampled from fixed but unknown distributions, our method achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation without relying on Slater's condition, thereby handling settings where no strictly feasible solution exists. Moreover, we provide guarantees on the stronger notion of \emph{positive} constraint violation, which does not allow to recover from large violation in the early episodes by playing strictly safe policies. In the adversarial regime, \emph{i.e.}, when the constraints may change arbitrarily between episodes, our algorithm ensures sublinear constraint violation without Slater's condition, and achieves sublinear $α$-regret with respect to the \emph{unconstrained} optimum, where $α$ is a suitably defined multiplicative approximation factor. We further validate our results through synthetic experiments, showing the practical effectiveness of our algorithm.
Chinese: 本文提出了一种新颖的在线情景约束马尔可夫决策过程算法,无需依赖斯莱特条件即可在随机和对抗两种环境下实现次线性遗憾和约束违反,显著改进了现有最佳算法的性能表现。
English: This paper introduces a novel algorithm for online episodic Constrained Markov Decision Processes that significantly improves upon existing methods by achieving sublinear regret and constraint violation in both stochastic and adversarial settings without requiring Slater's condition.

Authors:Hayden Jananthan, Jeremy Kepner, Michael Jones, Vijay Gadepally, Michael Houle, Peter Michaleas, Chasen Milner, Alex Pentland
Title: GraphBLAS Mathematical Opportunities: Parallel Hypersparse, Matrix Based Graph Streaming, and Complex-Index Matrices
Abstract:
The GraphBLAS high performance library standard has yielded capabilities beyond enabling graph algorithms to be readily expressed in the language of linear algebra. These GraphBLAS capabilities enable new performant ways of thinking about algorithms that include leveraging hypersparse matrices for parallel computation, matrix-based graph streaming, and complex-index matrices. Formalizing these concepts mathematically provides additional opportunities to apply GraphBLAS to new areas. This paper formally develops parallel hypersparse matrices, matrix-based graph streaming, and complex-index matrices and illustrates these concepts with various examples to demonstrate their potential merits.
中文摘要:GraphBLAS标准通过超稀疏矩阵、基于矩阵的图流和复杂索引矩阵实现了创新的算法方法,其数学形式化为拓展应用领域提供了新机遇。
English Summary: The GraphBLAS standard enables innovative algorithmic approaches through hypersparse matrices, graph streaming, and complex-index matrices, with formal mathematical development expanding its application potential.

Authors:Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, Guihai Chen
Title: ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
Abstract:
Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs) beyond pre-training and instruction tuning. A prominent line of work is RL with verifiable rewards (RLVR), which leverages automatically verifiable outcomes (e.g., correctness or executability) to generate reward signals. While efficient, this framework faces two key limitations: First, its binary feedback is too sparse to capture the quality of the reasoning process. Second, its coarse-grained rewards potentially lead to vanishing gradients. Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model's own confidence estimates. This joint design enriches the reward signal, providing finer-grained feedback and implicitly supervising the reasoning process. Experimental results demonstrate that our proposed method enhances RL performance across multiple datasets and reduces token consumption during inference, while incurring negligible additional training cost. Moreover, it can be used as a plug-in module to enhance other state-of-the-art RL methods.
中文: 本文提出一种强化学习技术,通过结合可验证结果与模型置信度来丰富奖励信号,在多个数据集上提升性能并降低推理成本,且训练开销极小。
English: This paper introduces a reinforcement learning technique that combines verifiable outcomes with model confidence to enrich reward signals, improving performance across datasets and reducing inference costs with minimal training overhead.

Authors:Gokul B. Nair, Alejandro Fontan, Michael Milford, Tobias Fischer
Title: Event-Based Visual Teach-and-Repeat via Fast Fourier-Domain Cross-Correlation
Abstract:
Visual teach-and-repeat navigation enables robots to autonomously traverse previously demonstrated paths by comparing current sensory input with recorded trajectories. However, conventional frame-based cameras fundamentally limit system responsiveness: their fixed frame rates (typically 30-60 Hz) create inherent latency between environmental changes and control responses. Here we present the first event-camera-based visual teach-and-repeat system. To achieve this, we develop a frequency-domain cross-correlation framework that transforms the event stream matching problem into computationally efficient Fourier space multiplications, capable of exceeding 300Hz processing rates, an order of magnitude faster than frame-based approaches. By exploiting the binary nature of event frames and applying image compression techniques, we further enhance the computational speed of the cross-correlation process without sacrificing localization accuracy. Extensive experiments using a Prophesee EVK4 HD event camera mounted on an AgileX Scout Mini robot demonstrate successful autonomous navigation across 4000+ meters of indoor and outdoor trajectories. Our system achieves ATEs below 24 cm while maintaining consistent high-frequency control updates. Our evaluations show that our approach achieves substantially higher update rates compared to conventional frame-based systems, underscoring the practical viability of event-based perception for real-time robotic navigation.
中文: 本研究首次提出基于事件相机的视觉示教导航系统,通过频域互相关框架处理事件流,实现超过300Hz的更新频率(比传统帧方法快十倍),并在4000多米的自主导航中保持低于24厘米的定位精度。
English: This study introduces the first event-camera-based visual teach-and-repeat navigation system, which processes event streams via a frequency-domain cross-correlation framework to achieve over 300Hz update rates—ten times faster than conventional frame-based methods—while maintaining sub-24cm accuracy across 4000+ meters of autonomous navigation.

Authors:Xiaoqiang Kang, Shengen Wu, Zimu Wang, Yilin Liu, Xiaobo Jin, Kaizhu Huang, Wei Wang, Yutao Yue, Xiaowei Huang, Qiufeng Wang
Title: Can GRPO Boost Complex Multimodal Table Understanding?
Abstract:
Existing table understanding methods face challenges due to complex table structures and intricate logical reasoning. While supervised finetuning (SFT) dominates existing research, reinforcement learning (RL), such as Group Relative Policy Optimization (GRPO), has shown promise but struggled with low initial policy accuracy and coarse rewards in tabular contexts. In this paper, we introduce Table-R1, a three-stage RL framework that enhances multimodal table understanding through: (1) Warm-up that prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO), which employs continuous Tree-Edit-Distance Similarity (TEDS) rewards for recognizing table structures and contents, and (3) Hint-Completion GRPO (HC-GRPO), which utilizes fine-grained rewards of residual steps based on the hint-guided question. Extensive experiments demonstrate that Table-R1 can boost the model's table reasoning performance obviously on both held-in and held-out datasets, outperforming SFT and GRPO largely. Notably, Qwen2-VL-7B with Table-R1 surpasses larger specific table understanding models (e.g., Table-LLaVA 13B), even achieving comparable performance to the closed-source model GPT-4o on held-in datasets, demonstrating the efficacy of each stage of Table-R1 in overcoming initialization bottlenecks and reward sparsity, thereby advancing robust multimodal table understanding.
中文: Table-R1框架通过三阶段强化学习方法克服了初始化和奖励稀疏性问题,显著提升了多模态表格理解能力,不仅超越现有方法,在部分数据集上甚至达到了与GPT-4o相当的性能。
English: The Table-R1 framework enhances multimodal table understanding through a three-stage reinforcement learning approach that overcomes initialization and reward challenges, significantly outperforming existing methods and even matching GPT-4o's performance on certain datasets.

Authors:Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen
Title: MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Abstract:
Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.
中文摘要:Manzano采用混合图像分词器和统一训练方案,有效平衡视觉理解与生成能力,在减少任务冲突的同时实现了领先性能。
English Summary: Manzano is a unified multimodal framework that integrates a hybrid image tokenizer and a curated training approach to effectively balance visual understanding and generation, achieving state-of-the-art performance with minimal task conflicts.

Authors:Tianyang Wang, Xi Xiao, Gaofei Chen, Xiaoying Liao, Guo Cheng, Yingrui Ji
Title: Boosting Active Learning with Knowledge Transfer
Abstract:
Uncertainty estimation is at the core of Active Learning (AL). Most existing methods resort to complex auxiliary models and advanced training fashions to estimate uncertainty for unlabeled data. These models need special design and hence are difficult to train especially for domain tasks, such as Cryo-Electron Tomography (cryo-ET) classification in computational biology. To address this challenge, we propose a novel method using knowledge transfer to boost uncertainty estimation in AL. Specifically, we exploit the teacher-student mode where the teacher is the task model in AL and the student is an auxiliary model that learns from the teacher. We train the two models simultaneously in each AL cycle and adopt a certain distance between the model outputs to measure uncertainty for unlabeled data. The student model is task-agnostic and does not rely on special training fashions (e.g. adversarial), making our method suitable for various tasks. More importantly, we demonstrate that data uncertainty is not tied to concrete value of task loss but closely related to the upper-bound of task loss. We conduct extensive experiments to validate the proposed method on classical computer vision tasks and cryo-ET challenges. The results demonstrate its efficacy and efficiency.
中文: 本研究提出了一种新颖的主动学习方法,通过师生模型间的知识迁移来提升不确定性估计,无需复杂辅助模型,并在冷冻电子断层扫描分类等任务中验证了其有效性。
English: This study introduces a novel active learning method that employs knowledge transfer between teacher and student models to enhance uncertainty estimation, eliminating the need for complex auxiliary models and proving effective across diverse tasks including cryo-ET classification.

Authors:Tianyang Wang, Xi Xiao, Gaofei Chen, Hanzhang Chi, Qi Zhang, Guo Cheng, Yingrui Ji
Title: TASAM: Terrain-and-Aware Segment Anything Model for Temporal-Scale Remote Sensing Segmentation
Abstract:
Segment Anything Model (SAM) has demonstrated impressive zero-shot segmentation capabilities across natural image domains, but it struggles to generalize to the unique challenges of remote sensing data, such as complex terrain, multi-scale objects, and temporal dynamics. In this paper, we introduce TASAM, a terrain and temporally-aware extension of SAM designed specifically for high-resolution remote sensing image segmentation. TASAM integrates three lightweight yet effective modules: a terrain-aware adapter that injects elevation priors, a temporal prompt generator that captures land-cover changes over time, and a multi-scale fusion strategy that enhances fine-grained object delineation. Without retraining the SAM backbone, our approach achieves substantial performance gains across three remote sensing benchmarks-LoveDA, iSAID, and WHU-CD-outperforming both zero-shot SAM and task-specific models with minimal computational overhead. Our results highlight the value of domain-adaptive augmentation for foundation models and offer a scalable path toward more robust geospatial segmentation.
Chinese: 针对SAM在遥感数据上的不足,TASAM通过集成地形感知、时序提示和多尺度融合模块,在不重新训练SAM主干的情况下显著提升了分割性能。
English: The Segment Anything Model (SAM) struggles with remote sensing data, so TASAM is introduced, incorporating terrain awareness, temporal prompts, and multi-scale fusion to achieve significant performance improvements without retraining SAM's backbone.

Authors:Zhenghao Zhao, Haoxuan Wang, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Title: Efficient Multimodal Dataset Distillation via Generative Models
Abstract:
Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18x faster than the state-of-the-art method.
中文: 本文提出EDGE这一高效的多模态数据集蒸馏生成方法,通过双向对比损失和多样性损失解决图像-文本关联性与样本多样性问题,结合标题合成策略,在三大数据集上实现性能超越且提速18倍。
English: This paper introduces EDGE, an efficient generative method for multimodal dataset distillation that overcomes the computational limitations of existing approaches by addressing image-caption correlation and sample diversity through novel loss functions and caption synthesis, achieving superior performance 18 times faster.

Authors:Zhenghao Zhao, Haoxuan Wang, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Title: Efficient Multimodal Dataset Distillation via Generative Models
Abstract:
Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18x faster than the state-of-the-art method.
中文: 本文提出EDGE这一高效的多模态数据集蒸馏生成方法,通过双向对比损失和多样性损失解决图像-文本关联性与样本多样性问题,结合标题合成策略,在三大数据集上实现性能超越且提速18倍。
English: This paper introduces EDGE, an efficient generative method for multimodal dataset distillation that overcomes the computational limitations of existing approaches by addressing image-caption correlation and sample diversity through novel loss functions and caption synthesis, achieving superior performance 18 times faster.

Authors:Adam D. Hines, Alejandro Fontan, Michael Milford, Tobias Fischer
Title: Event-LAB: Towards Standardized Evaluation of Neuromorphic Localization Methods
Abstract:
Event-based localization research and datasets are a rapidly growing area of interest, with a tenfold increase in the cumulative total number of published papers on this topic over the past 10 years. Whilst the rapid expansion in the field is exciting, it brings with it an associated challenge: a growth in the variety of required code and package dependencies as well as data formats, making comparisons difficult and cumbersome for researchers to implement reliably. To address this challenge, we present Event-LAB: a new and unified framework for running several event-based localization methodologies across multiple datasets. Event-LAB is implemented using the Pixi package and dependency manager, that enables a single command-line installation and invocation for combinations of localization methods and datasets. To demonstrate the capabilities of the framework, we implement two common event-based localization pipelines: Visual Place Recognition (VPR) and Simultaneous Localization and Mapping (SLAM). We demonstrate the ability of the framework to systematically visualize and analyze the results of multiple methods and datasets, revealing key insights such as the association of parameters that control event collection counts and window sizes for frame generation to large variations in performance. The results and analysis demonstrate the importance of fairly comparing methodologies with consistent event image generation parameters. Our Event-LAB framework provides this ability for the research community, by contributing a streamlined workflow for easily setting up multiple conditions.
中文: Event-LAB框架作为一个统一平台,解决了事件定位研究中代码依赖和数据格式多样化带来的难题,简化了多种方法与数据集的比较流程。
English: The Event-LAB framework is introduced as a unified solution to simplify the implementation and comparison of diverse event-based localization methods and datasets, addressing challenges from growing code dependencies and data format variations.

Authors:Artem Lykov, Oleg Kobzarev, Dzmitry Tsetserukou
Title: GestOS: Advanced Hand Gesture Interpretation via Large Language Models to control Any Type of Robot
Abstract:
We present GestOS, a gesture-based operating system for high-level control of heterogeneous robot teams. Unlike prior systems that map gestures to fixed commands or single-agent actions, GestOS interprets hand gestures semantically and dynamically distributes tasks across multiple robots based on their capabilities, current state, and supported instruction sets. The system combines lightweight visual perception with large language model (LLM) reasoning: hand poses are converted into structured textual descriptions, which the LLM uses to infer intent and generate robot-specific commands. A robot selection module ensures that each gesture-triggered task is matched to the most suitable agent in real time. This architecture enables context-aware, adaptive control without requiring explicit user specification of targets or commands. By advancing gesture interaction from recognition to intelligent orchestration, GestOS supports scalable, flexible, and user-friendly collaboration with robotic systems in dynamic environments.
中文: GestOS是一种基于手势的操作系统,通过视觉感知与大语言模型推理,将手势语义化解析并实时匹配最适合的机器人执行任务,实现动态环境中自适应、可扩展的人机协作。
English: GestOS is a gesture-based operating system that semantically interprets hand gestures and dynamically allocates tasks to multiple robots using visual perception and LLM reasoning for adaptive, context-aware control in dynamic environments.

Authors:Valerii Serpiva, Artem Lykov, Faryal Batool, Vladislav Kozlovskiy, Miguel Altamirano Cabrera, Dzmitry Tsetserukou
Title: FlightDiffusion: Revolutionising Autonomous Drone Training with Diffusion Models Generating FPV Video
Abstract:
We present FlightDiffusion, a diffusion-model-based framework for training autonomous drones from first-person view (FPV) video. Our model generates realistic video sequences from a single frame, enriched with corresponding action spaces to enable reasoning-driven navigation in dynamic environments. Beyond direct policy learning, FlightDiffusion leverages its generative capabilities to synthesize diverse FPV trajectories and state-action pairs, facilitating the creation of large-scale training datasets without the high cost of real-world data collection. Our evaluation demonstrates that the generated trajectories are physically plausible and executable, with a mean position error of 0.25 m (RMSE 0.28 m) and a mean orientation error of 0.19 rad (RMSE 0.24 rad). This approach enables improved policy learning and dataset scalability, leading to superior performance in downstream navigation tasks. Results in simulated environments highlight enhanced robustness, smoother trajectory planning, and adaptability to unseen conditions. An ANOVA revealed no statistically significant difference between performance in simulation and reality (F(1, 16) = 0.394, p = 0.541), with success rates of M = 0.628 (SD = 0.162) and M = 0.617 (SD = 0.177), respectively, indicating strong sim-to-real transfer. The generated datasets provide a valuable resource for future UAV research. This work introduces diffusion-based reasoning as a promising paradigm for unifying navigation, action generation, and data synthesis in aerial robotics.
中文: FlightDiffusion是一个基于扩散模型的框架,能从单帧图像生成逼真的第一视角视频序列及对应动作空间,实现自主无人机导航和大规模数据集合成,并展现出优异的仿真到现实迁移能力。
English: FlightDiffusion is a diffusion-based framework that generates realistic FPV video sequences and corresponding action spaces from a single frame, enabling autonomous drone navigation and scalable dataset synthesis with strong sim-to-real transfer.

Authors:Artem Lykov, Jeffrin Sam, Hung Khang Nguyen, Vladislav Kozlovskiy, Yara Mahmoud, Valerii Serpiva, Miguel Altamirano Cabrera, Mikhail Konenkov, Dzmitry Tsetserukou
Title: PhysicalAgent: Towards General Cognitive Robotics with Foundation World Models
Abstract:
We introduce PhysicalAgent, an agentic framework for robotic manipulation that integrates iterative reasoning, diffusion-based video generation, and closed-loop execution. Given a textual instruction, our method generates short video demonstrations of candidate trajectories, executes them on the robot, and iteratively re-plans in response to failures. This approach enables robust recovery from execution errors. We evaluate PhysicalAgent across multiple perceptual modalities (egocentric, third-person, and simulated) and robotic embodiments (bimanual UR3, Unitree G1 humanoid, simulated GR1), comparing against state-of-the-art task-specific baselines. Experiments demonstrate that our method consistently outperforms prior approaches, achieving up to 83% success on human-familiar tasks. Physical trials reveal that first-attempt success is limited (20-30%), yet iterative correction increases overall success to 80% across platforms. These results highlight the potential of video-based generative reasoning for general-purpose robotic manipulation and underscore the importance of iterative execution for recovering from initial failures. Our framework paves the way for scalable, adaptable, and robust robot control.
中文:PhysicalAgent是一种机器人操作框架,通过迭代推理和基于扩散的视频生成来规划执行任务,实现强韧的错误恢复能力,经迭代修正后成功率高达80%,显著优于现有方法。
English: PhysicalAgent is a robotic manipulation framework that uses iterative reasoning and diffusion-based video generation to plan and execute tasks, enabling robust error recovery and outperforming existing methods with up to 80% success after iterative corrections.

Authors:Nguyen Hoang Khoi Tran, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Title: InterKey: Cross-modal Intersection Keypoints for Global Localization on OpenStreetMap
Abstract:
Reliable global localization is critical for autonomous vehicles, especially in environments where GNSS is degraded or unavailable, such as urban canyons and tunnels. Although high-definition (HD) maps provide accurate priors, the cost of data collection, map construction, and maintenance limits scalability. OpenStreetMap (OSM) offers a free and globally available alternative, but its coarse abstraction poses challenges for matching with sensor data. We propose InterKey, a cross-modal framework that leverages road intersections as distinctive landmarks for global localization. Our method constructs compact binary descriptors by jointly encoding road and building imprints from point clouds and OSM. To bridge modality gaps, we introduce discrepancy mitigation, orientation determination, and area-equalized sampling strategies, enabling robust cross-modal matching. Experiments on the KITTI dataset demonstrate that InterKey achieves state-of-the-art accuracy, outperforming recent baselines by a large margin. The framework generalizes to sensors that can produce dense structural point clouds, offering a scalable and cost-effective solution for robust vehicle localization.
中文摘要:InterKey框架利用道路交叉口作为地标,通过点云和开放街道地图生成紧凑的二进制描述符,采用跨模态匹配策略实现了卓越的车辆全局定位精度。
English Summary: The InterKey framework utilizes road intersections as landmarks for global vehicle localization by creating compact binary descriptors from point clouds and OpenStreetMap, achieving superior accuracy through cross-modal matching strategies.

Authors:Nguyen Hoang Khoi Tran, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Title: InterKey: Cross-modal Intersection Keypoints for Global Localization on OpenStreetMap
Abstract:
Reliable global localization is critical for autonomous vehicles, especially in environments where GNSS is degraded or unavailable, such as urban canyons and tunnels. Although high-definition (HD) maps provide accurate priors, the cost of data collection, map construction, and maintenance limits scalability. OpenStreetMap (OSM) offers a free and globally available alternative, but its coarse abstraction poses challenges for matching with sensor data. We propose InterKey, a cross-modal framework that leverages road intersections as distinctive landmarks for global localization. Our method constructs compact binary descriptors by jointly encoding road and building imprints from point clouds and OSM. To bridge modality gaps, we introduce discrepancy mitigation, orientation determination, and area-equalized sampling strategies, enabling robust cross-modal matching. Experiments on the KITTI dataset demonstrate that InterKey achieves state-of-the-art accuracy, outperforming recent baselines by a large margin. The framework generalizes to sensors that can produce dense structural point clouds, offering a scalable and cost-effective solution for robust vehicle localization.
中文摘要:InterKey框架利用道路交叉口作为地标,通过点云和开放街道地图生成紧凑的二进制描述符,采用跨模态匹配策略实现了卓越的车辆全局定位精度。
English Summary: The InterKey framework utilizes road intersections as landmarks for global vehicle localization by creating compact binary descriptors from point clouds and OpenStreetMap, achieving superior accuracy through cross-modal matching strategies.

Authors:Takuya Kiyokawa, Zhengtao Hu, Weiwei Wan, Kensuke Harada
Title: Soft Regrasping Tool Inspired by Jamming Gripper
Abstract:
Regrasping on fixtures is a promising approach to reduce pose uncertainty in robotic assembly, but conventional rigid fixtures lack adaptability and require dedicated designs for each part. To overcome this limitation, we propose a soft jig inspired by the jamming transition phenomenon, which can be continuously deformed to accommodate diverse object geometries. By pressing a triangular-pyramid-shaped tool into the membrane and evacuating the enclosed air, a stable cavity is formed as a placement space. We further optimize the stamping depth to balance placement stability and gripper accessibility. In soft-jig-based regrasping, the key challenge lies in optimizing the cavity size to achieve precise dropping; once the part is reliably placed, subsequent grasping can be performed with reduced uncertainty. Accordingly, we conducted drop experiments on ten mechanical parts of varying shapes, which achieved placement success rates exceeding 80% for most objects and above 90% for cylindrical ones, while failures were mainly caused by geometric constraints and membrane properties. These results demonstrate that the proposed jig enables general-purpose, accurate, and repeatable regrasping, while also clarifying its current limitations and future potential as a practical alternative to rigid fixtures in assembly automation.
中文: 该软夹具利用阻塞转变形成自适应腔体实现机器人重抓取,在多种零件上实现了超过80%的放置成功率,同时解决了稳定性与几何约束问题。
English: The proposed soft jig utilizes jamming transition to form adaptable cavities for robotic regrasping, achieving over 80% placement success across diverse parts while addressing stability and geometric limitations.

Authors:Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, Jing Shao
Title: The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
Abstract:
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.
Chinese: 本文提出了一种通过分析大型语言模型的隐藏表示来估计输入问题难度的新方法,无需生成输出标记,并在多种任务中提高了推理效率。
English: This paper introduces a novel method for estimating the difficulty of input questions for large language models by analyzing their hidden representations, which avoids output token generation and enhances inference efficiency in various tasks.

Authors:Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou
Title: FunAudio-ASR Technical Report
Abstract:
In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
Chinese: Fun-ASR 是一种基于大语言模型的大规模自动语音识别系统,通过整合海量数据、大模型能力和强化学习,在实际应用中实现了顶尖性能,有效解决了幻觉等问题,并优化了流式处理、抗噪等实用功能。
English: Fun-ASR is a large-scale, LLM-based automatic speech recognition system that integrates massive data, large models, and reinforcement learning to achieve state-of-the-art performance in real-world applications, addressing challenges like hallucination and enhancing practical features such as streaming and noise robustness.

Authors:Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou
Title: Fun-ASR Technical Report
Abstract:
In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present Fun-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
Chinese: Fun-ASR 是一种基于大语言模型的大规模自动语音识别系统,通过整合海量数据、大模型能力和强化学习,在实际应用中实现了顶尖性能,有效解决了幻觉等问题,并优化了流式处理、抗噪等实用功能。
English: Fun-ASR is a large-scale, LLM-based automatic speech recognition system that integrates massive data, large models, and reinforcement learning to achieve state-of-the-art performance in real-world applications, addressing challenges like hallucination and enhancing practical features such as streaming and noise robustness.

Authors:Liangqi Yuan, Dong-Jun Han, Christopher G. Brinton, Sabine Brunswicker
Title: LLMAP: LLM-Assisted Multi-Objective Route Planning with User Preferences
Abstract:
The rise of large language models (LLMs) has made natural language-driven route planning an emerging research area that encompasses rich user objectives. Current research exhibits two distinct approaches: direct route planning using LLM-as-Agent and graph-based searching strategies. However, LLMs in the former approach struggle to handle extensive map data, while the latter shows limited capability in understanding natural language preferences. Additionally, a more critical challenge arises from the highly heterogeneous and unpredictable spatio-temporal distribution of users across the globe. In this paper, we introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an LLM-as-Parser to comprehend natural language, identify tasks, and extract user preferences and recognize task dependencies, coupled with a Multi-Step Graph construction with iterative Search (MSGS) algorithm as the underlying solver for optimal route finding. Our multi-objective optimization approach adaptively tunes objective weights to maximize points of interest (POI) quality and task completion rate while minimizing route distance, subject to three key constraints: user time limits, POI opening hours, and task dependencies. We conduct extensive experiments using 1,000 routing prompts sampled with varying complexity across 14 countries and 27 cities worldwide. The results demonstrate that our approach achieves superior performance with guarantees across multiple constraints.
中文摘要:本研究提出了一种新颖的LLM辅助路径规划系统,通过结合自然语言理解与多步骤图优化技术,有效解决了现有方法在处理地图数据和用户偏好方面的局限,在全球路径规划场景中实现了卓越性能。
English Summary: The study introduces a novel LLM-Assisted route Planning (LLMAP) system that combines natural language understanding with multi-step graph optimization to address limitations in handling map data and user preferences, achieving superior performance in global routing scenarios.

Authors:Michael Kölle, Simon Salfer, Tobias Rohe, Philipp Altmann, Claudia Linnhoff-Popien
Title: Quantum Architecture Search for Solving Quantum Machine Learning Tasks
Abstract:
Quantum computing leverages quantum mechanics to address computational problems in ways that differ fundamentally from classical approaches. While current quantum hardware remains error-prone and limited in scale, Variational Quantum Circuits offer a noise-resilient framework suitable for today's devices. The performance of these circuits strongly depends on the underlying architecture of their parameterized quantum components. Identifying efficient, hardware-compatible quantum circuit architectures -- known as Quantum Architecture Search (QAS) -- is therefore essential. Manual QAS is complex and error-prone, motivating efforts to automate it. Among various automated strategies, Reinforcement Learning (RL) remains underexplored, particularly in Quantum Machine Learning contexts. This work introduces RL-QAS, a framework that applies RL to discover effective circuit architectures for classification tasks. We evaluate RL-QAS using the Iris and binary MNIST datasets. The agent autonomously discovers low-complexity circuit designs that achieve high test accuracy. Our results show that RL is a viable approach for automated architecture search in quantum machine learning. However, applying RL-QAS to more complex tasks will require further refinement of the search strategy and performance evaluation mechanisms.
中文: 本文提出RL-QAS强化学习框架,通过自动化量子架构搜索成功设计出高效量子电路,在Iris和二元MNIST分类任务中实现高精度,验证了强化学习在量子机器学习中的可行性,同时指出需进一步优化以应对更复杂任务。
English: This paper introduces RL-QAS, a reinforcement learning framework for automated quantum architecture search, which successfully identifies efficient quantum circuit designs achieving high accuracy on classification tasks like Iris and binary MNIST datasets, demonstrating RL's viability in quantum machine learning while noting the need for further refinements for more complex applications.

Authors:Michael Kölle, Leonhard Klingert, Julian Schönberger, Philipp Altmann, Tobias Rohe, Claudia Linnhoff-Popien
Title: Investigating the Lottery Ticket Hypothesis for Variational Quantum Circuits
Abstract:
Quantum computing is an emerging field in computer science that has seen considerable progress in recent years, especially in machine learning. By harnessing the principles of quantum physics, it can surpass the limitations of classical algorithms. However, variational quantum circuits (VQCs), which rely on adjustable parameters, often face the barren plateau phenomenon, hindering optimization. The Lottery Ticket Hypothesis (LTH) is a recent concept in classical machine learning that has led to notable improvements in parameter efficiency for neural networks. It states that within a large network, a smaller, more efficient subnetwork, or ''winning ticket,'' can achieve comparable performance, potentially circumventing plateau challenges. In this work, we investigate whether this idea can apply to VQCs. We show that the weak LTH holds for VQCs, revealing winning tickets that retain just 26.0\% of the original parameters. For the strong LTH, where a pruning mask is learned without any training, we discovered a winning ticket in a binary VQC, achieving 100\% accuracy with only 45\% of the weights. These findings indicate that LTH may mitigate barren plateaus by reducing parameter counts while preserving performance, thus enhancing the efficiency of VQCs in quantum machine learning tasks.
中文摘要:研究表明彩票假设适用于变分量子电路,发现仅保留少量参数的"中奖彩票"仍能保持高性能,有望克服贫瘠高原问题并提升量子机器学习效率。
English Summary: The study demonstrates that the Lottery Ticket Hypothesis applies to variational quantum circuits, revealing winning tickets that maintain high performance with significantly fewer parameters, potentially overcoming barren plateaus and improving quantum machine learning efficiency.

Authors:Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, Xingang Wang
Title: Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization
Abstract:
Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress-process-recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.
Chinese: 本文提出VQBridge,一种基于映射函数方法的鲁棒投影器,通过压缩-处理-恢复流程优化码书训练,实现100%码书使用率,显著提升重建质量,并在与LlamaGen结合时超越现有图像生成模型。
English: This paper introduces VQBridge, a robust projector that stabilizes vector quantization training by ensuring full codebook usage, leading to superior reconstruction and enhanced image generation performance when integrated with models like LlamaGen.

Authors:Sophia Lockton, Jeremy Kepner, Michael Stonebraker, Hayden Jananthan, LaToya Anderson, William Arcand, David Bestor, William Bergeron, Alex Bonn, Daniel Burrill, Chansup Byun, Timothy Davis, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Piotr Luszczek, Peter Michaleas, Lauren Milechin, Chasen Milner, Guillermo Morales, Julie Mullen, Michel Pelletier, Alex Poliakov, Andrew Prout, Albert Reuther, Antonio Rosa, Charles Yee, Alex Pentland
Title: DBOS Network Sensing: A Web Services Approach to Collaborative Awareness
Abstract:
DBOS (DataBase Operating System) is a novel capability that integrates web services, operating system functions, and database features to significantly reduce web-deployment effort while increasing resilience. Integration of high performance network sensing enables DBOS web services to collaboratively create a shared awareness of their network environments to enhance their collective resilience and security. Network sensing is added to DBOS using GraphBLAS hypersparse traffic matrices via two approaches: (1) Python-GraphBLAS and (2) OneSparse PostgreSQL. These capabilities are demonstrated using the workflow and analytics from the IEEE/MIT/Amazon Anonymized Network Sensing Graph Challenge. The system was parallelized using pPython and benchmarked using 64 compute nodes on the MIT SuperCloud. The web request rate sustained by a single DBOS instance was ${>}10^5$, well above the required maximum, indicating that network sensing can be added to DBOS with negligible overhead. For collaborative awareness, many DBOS instances were connected to a single DBOS aggregator. The Python-GraphBLAS and OneSparse PostgreSQL implementations scaled linearly up to 64 and 32 nodes respectively. These results suggest that DBOS collaborative network awareness can be achieved with a negligible increase in computing resources.
中文摘要:DBOS通过整合网络服务、操作系统功能与数据库特性,显著简化了网络部署并提升了系统韧性;采用GraphBLAS方法实现的网络感知功能以可忽略的性能开销实现了多实例间的协同感知,且具备线性扩展能力。
English Summary: DBOS integrates web services, OS functions, and database capabilities to streamline web deployment and enhance resilience, with network sensing added via GraphBLAS methods showing negligible performance overhead while enabling scalable collaborative awareness among instances.

Authors:Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
Title: Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
Abstract:
In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising "aha moment" arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.
中文: 本文提出了一种通过自编码器框架统一多模态理解与生成的方法,表明两者在重建目标下的双向强化能显著提升视觉感知和生成质量。
English: This paper introduces a unified multimodal model that bridges understanding and generation through an auto-encoder framework, demonstrating that bidirectional reinforcement between these tasks enhances both visual perception and generation fidelity.

Authors:Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
Title: Unified Multimodal Model as Auto-Encoder
Abstract:
The pursuit of unified multimodal models (UMMs) has long been hindered by a fundamental schism between multimodal understanding and generation. Current approaches typically disentangle the two and treat them as separate endeavors with disjoint objectives, missing the mutual benefits. We argue that true unification requires more than just merging two tasks. It requires a unified, foundational objective that intrinsically links them. In this paper, we introduce an insightful paradigm through the Auto-Encoder lens, i.e., regarding understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. To implement this, we propose UAE, where we begin by pre-training the decoder with the proposed 700k long-context image-caption pairs to direct it to "understand" the fine-grained and complex semantics from the text. We then propose Unified-GRPO via reinforcement learning (RL) to unify the two, which covers two complementary stages: (1) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder's reconstruction quality, enhancing its visual perception; (2) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. Our empirical results suggest that understanding can largely enhance generation (verified on GenEval), while generation, in turn, notably strengthens fine-grained visual perception like small object and color recognition (verified on MMT-Bench). This bidirectional improvement reveals a deep synergy: under the unified reconstruction objective, generation and understanding can mutually benefit each other, moving closer to truly unified multimodal intelligence.
中文: 本文提出了一种通过自编码器框架统一多模态理解与生成的方法,表明两者在重建目标下的双向强化能显著提升视觉感知和生成质量。
English: This paper introduces a unified multimodal model that bridges understanding and generation through an auto-encoder framework, demonstrating that bidirectional reinforcement between these tasks enhances both visual perception and generation fidelity.

Authors:Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, Hao Xu
Title: Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models
Abstract:
Large Language Models (LLMs) have transformed both everyday life and scientific research. However, adapting LLMs from general-purpose models to specialized tasks remains challenging, particularly in resource-constrained environments. Low-Rank Adaptation (LoRA), a prominent method within Parameter-Efficient Fine-Tuning (PEFT), has emerged as a promising approach to LLMs by approximating model weight updates using low-rank decomposition. However, LoRA is limited by its uniform rank ( r ) allocation to each incremental matrix, and existing rank allocation techniques aimed at addressing this issue remain computationally inefficient, complex, and unstable, hindering practical applications. To address these limitations, we propose Sensitivity-LoRA, an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on both their global and local sensitivities. It leverages the second-order derivatives (Hessian Matrix) of the loss function to effectively capture weight sensitivity, enabling optimal rank allocation with minimal computational overhead. Our experimental results have demonstrated robust effectiveness, efficiency and stability of Sensitivity-LoRA across diverse tasks and benchmarks.
中文:Sensitivity-LoRA是一种高效的微调方法,它基于权重矩阵的全局和局部敏感性动态分配秩,利用二阶导数以最小计算开销优化性能,在不同任务中展现出强大的有效性。
English: Sensitivity-LoRA is an efficient fine-tuning method that dynamically allocates ranks to weight matrices based on their global and local sensitivities, using second-order derivatives to optimize performance with minimal computational overhead, demonstrating robust effectiveness across various tasks.

Authors:Ce Guo, Xieyuanli Chen, Zhiwen Zeng, Zirui Guo, Yihong Li, Haoran Xiao, Dewen Hu, Huimin Lu
Title: Grasp Like Humans: Learning Generalizable Multi-Fingered Grasping from Human Proprioceptive Sensorimotor Integration
Abstract:
Tactile and kinesthetic perceptions are crucial for human dexterous manipulation, enabling reliable grasping of objects via proprioceptive sensorimotor integration. For robotic hands, even though acquiring such tactile and kinesthetic feedback is feasible, establishing a direct mapping from this sensory feedback to motor actions remains challenging. In this paper, we propose a novel glove-mediated tactile-kinematic perception-prediction framework for grasp skill transfer from human intuitive and natural operation to robotic execution based on imitation learning, and its effectiveness is validated through generalized grasping tasks, including those involving deformable objects. Firstly, we integrate a data glove to capture tactile and kinesthetic data at the joint level. The glove is adaptable for both human and robotic hands, allowing data collection from natural human hand demonstrations across different scenarios. It ensures consistency in the raw data format, enabling evaluation of grasping for both human and robotic hands. Secondly, we establish a unified representation of multi-modal inputs based on graph structures with polar coordinates. We explicitly integrate the morphological differences into the designed representation, enhancing the compatibility across different demonstrators and robotic hands. Furthermore, we introduce the Tactile-Kinesthetic Spatio-Temporal Graph Networks (TK-STGN), which leverage multidimensional subgraph convolutions and attention-based LSTM layers to extract spatio-temporal features from graph inputs to predict node-based states for each hand joint. These predictions are then mapped to final commands through a force-position hybrid mapping.
中文摘要:本文提出一种基于数据手套的触觉-运动感知预测框架,通过模仿学习将人类抓取技能迁移至机器人,利用图结构网络处理多模态数据,成功实现了包括可变形物体在内的通用抓取任务验证。
English Summary: This paper introduces a glove-based framework that transfers human grasp skills to robots using imitation learning, effectively handling various objects including deformable ones through integrated tactile-kinesthetic data and graph-based neural networks.

Authors:William Cashman, Chasen Milner, Michael Houle, Michael Jones, Hayden Jananthan, Jeremy Kepner, Peter Michaleas, Alex Pentland
Title: Accelerating AI Development with Cyber Arenas
Abstract:
AI development requires high fidelity testing environments to effectively transition from the laboratory to operations. The flexibility offered by cyber arenas presents a novel opportunity to test new artificial intelligence (AI) capabilities with users. Cyber arenas are designed to expose end-users to real-world situations and must rapidly incorporate evolving capabilities to meet their core objectives. To explore this concept the MIT/IEEE/Amazon Graph Challenge Anonymized Network Sensor was deployed in a cyber arena during a National Guard exercise.
中文: 网络竞技场通过在国家警卫队演习中部署MIT/IEEE/亚马逊图挑战项目,展示了其为用户测试人工智能能力提供的灵活且逼真的环境。
English: Cyber arenas provide a flexible and realistic environment for testing AI capabilities with users, as demonstrated by the deployment of the MIT/IEEE/Amazon Graph Challenge in a National Guard exercise.

Authors:Pengxiang Zhao, Guangyi Liu, Yaozhen Liang, Weiqing He, Zhengxi Lu, Yuehao Huang, Yaxuan Guo, Kexin Zhang, Hao Wang, Liang Liu, Yong Liu
Title: MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents
Abstract:
To enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a framework for systematically benchmarking these hybrid agents is still underexplored. To take the first step in bridging this gap, we introduce MAS-Bench, a benchmark that pioneers the evaluation of GUI-shortcut hybrid agents with a specific focus on the mobile domain. Beyond merely using predefined shortcuts, MAS-Bench assesses an agent's capability to autonomously generate shortcuts by discovering and creating reusable, low-cost workflows. It features 139 complex tasks across 11 real-world applications, a knowledge base of 88 predefined shortcuts (APIs, deep-links, RPA scripts), and 7 evaluation metrics. The tasks are designed to be solvable via GUI-only operations, but can be significantly accelerated by intelligently embedding shortcuts. Experiments show that hybrid agents achieve significantly higher success rates and efficiency than their GUI-only counterparts. This result also demonstrates the effectiveness of our method for evaluating an agent's shortcut generation capabilities. MAS-Bench fills a critical evaluation gap, providing a foundational platform for future advancements in creating more efficient and robust intelligent agents.
中文: MAS-Bench作为首个评估移动端GUI-快捷方式混合智能体的基准,通过139项复杂任务证明混合方法在成功率和效率上显著优于纯图形界面操作,填补了该领域的关键评估空白。
English: MAS-Bench introduces a pioneering benchmark for evaluating GUI-shortcut hybrid agents in mobile applications, featuring 139 complex tasks and demonstrating that hybrid agents significantly outperform GUI-only methods in success rates and efficiency.

Authors:Penelope Brown, Julie Stephany Berrio Perez, Mao Shan, Stewart Worrall
Title: Multi-Modal Camera-Based Detection of Vulnerable Road Users
Abstract:
Vulnerable road users (VRUs) such as pedestrians, cyclists, and motorcyclists represent more than half of global traffic deaths, yet their detection remains challenging in poor lighting, adverse weather, and unbalanced data sets. This paper presents a multimodal detection framework that integrates RGB and thermal infrared imaging with a fine-tuned YOLOv8 model. Training leveraged KITTI, BDD100K, and Teledyne FLIR datasets, with class re-weighting and light augmentations to improve minority-class performance and robustness, experiments show that 640-pixel resolution and partial backbone freezing optimise accuracy and efficiency, while class-weighted losses enhance recall for rare VRUs. Results highlight that thermal models achieve the highest precision, and RGB-to-thermal augmentation boosts recall, demonstrating the potential of multimodal detection to improve VRU safety at intersections.
中文: 本研究提出了一种融合RGB与热成像的多模态检测框架,通过优化YOLOv8模型在恶劣环境下提升对弱势道路使用者的检测能力,实验表明热成像模型精度最高,且跨模态增强技术显著提高了罕见目标的召回率。
English: This study introduces a multimodal framework combining RGB and thermal imaging with an optimized YOLOv8 model to enhance detection of vulnerable road users, achieving improved precision through thermal data and boosted recall via cross-modal augmentation under challenging conditions.

Authors:Hao Wen, Yifan Su, Feifei Zhang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li
Title: ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute
Abstract:
Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model's capability but a flaw in the scaling strategy itself, a phenomenon we term "Tunnel Vision", where a model's imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model's latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.
中文: 该摘要介绍了ParaThinker框架,它通过并行生成多样化思维路径来克服大型语言模型顺序推理的瓶颈,从而以极低延迟实现显著准确率提升。
English: The abstract introduces ParaThinker, a framework that overcomes the bottleneck of sequential reasoning in large language models by enabling parallel generation of diverse thought paths, leading to significant accuracy gains with minimal latency.

Authors:Weizhi Chen, Ziwei Wang, Leyang Yang, Sheng Zhou, Xiaoxuan Tang, Jiajun Bu, Yong Li, Wei Jiang
Title: PG-Agent: An Agent Powered by Page Graph
Abstract:
Graphical User Interface (GUI) agents possess significant commercial and social value, and GUI agents powered by advanced multimodal large language models (MLLMs) have demonstrated remarkable potential. Currently, existing GUI agents usually utilize sequential episodes of multi-step operations across pages as the prior GUI knowledge, which fails to capture the complex transition relationship between pages, making it challenging for the agents to deeply perceive the GUI environment and generalize to new scenarios. Therefore, we design an automated pipeline to transform the sequential episodes into page graphs, which explicitly model the graph structure of the pages that are naturally connected by actions. To fully utilize the page graphs, we further introduce Retrieval-Augmented Generation (RAG) technology to effectively retrieve reliable perception guidelines of GUI from them, and a tailored multi-agent framework PG-Agent with task decomposition strategy is proposed to be injected with the guidelines so that it can generalize to unseen scenarios. Extensive experiments on various benchmarks demonstrate the effectiveness of PG-Agent, even with limited episodes for page graph construction.
中文:基于多模态大语言模型的图形用户界面代理潜力巨大,但现有方法依赖顺序操作难以捕捉页面间复杂转换关系,为此设计了将操作序列转化为页面图结构的方法,结合检索增强生成技术和多代理框架,实验证明该方案在有限数据下仍能有效泛化至新场景。
English: GUI agents leveraging multimodal large language models show great potential, but current approaches using sequential operations struggle with page transitions and generalization, prompting the development of a page graph-based framework enhanced with retrieval-augmented generation and a multi-agent system that demonstrates strong performance across benchmarks even with limited data.

Authors:Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li
Title: Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
Abstract:
Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce \textbf{Align-Then-stEer (\texttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. \texttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to \textbf{9.8\%} in simulation and achieves a striking \textbf{32\% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
中文摘要:Align-Then-stEer (ATE) 框架通过构建统一动作空间并引导生成过程,有效解决了视觉-语言-动作模型在跨平台任务中的适应难题,在仿真和现实环境中均实现了显著性能提升。
English Summary: The Align-Then-stEer (ATE) framework efficiently adapts Vision-Language-Action models to new robotic tasks by aligning action spaces and steering generation processes, achieving significant performance improvements in both simulation and real-world scenarios.

Authors:Therese Joseph, Tobias Fischer, Michael Milford
Title: Ensemble-Based Event Camera Place Recognition Under Varying Illumination
Abstract:
Compared to conventional cameras, event cameras provide a high dynamic range and low latency, offering greater robustness to rapid motion and challenging lighting conditions. Although the potential of event cameras for visual place recognition (VPR) has been established, developing robust VPR frameworks under severe illumination changes remains an open research problem. In this paper, we introduce an ensemble-based approach to event camera place recognition that combines sequence-matched results from multiple event-to-frame reconstructions, VPR feature extractors, and temporal resolutions. Unlike previous event-based ensemble methods, which only utilise temporal resolution, our broader fusion strategy delivers significantly improved robustness under varied lighting conditions (e.g., afternoon, sunset, night), achieving a 57% relative improvement in Recall@1 across day-night transitions. We evaluate our approach on two long-term driving datasets (with 8 km per traverse) without metric subsampling, thereby preserving natural variations in speed and stop duration that influence event density. We also conduct a comprehensive analysis of key design choices, including binning strategies, polarity handling, reconstruction methods, and feature extractors, to identify the most critical components for robust performance. Additionally, we propose a modification to the standard sequence matching framework that enhances performance at longer sequence lengths. To facilitate future research, we will release our codebase and benchmarking framework.
中文: 本文提出了一种基于事件相机的集成视觉位置识别方法,通过融合多种事件帧重建、特征提取器和时间分辨率,在变化光照条件下实现了Recall@1指标57%的相对提升,同时分析了关键设计要素并改进了序列匹配框架。
English: This paper introduces an ensemble-based visual place recognition method for event cameras that combines multiple event-to-frame reconstructions, feature extractors, and temporal resolutions, achieving a 57% improvement in Recall@1 under varying lighting conditions while analyzing key design components and proposing sequence matching enhancements.

Authors:Abdelrhman Werby, Martin Büchner, Adrian Röfer, Chenguang Huang, Wolfram Burgard, Abhinav Valada
Title: Articulated Object Estimation in the Wild
Abstract:
Understanding the 3D motion of articulated objects is essential in robotic scene understanding, mobile manipulation, and motion planning. Prior methods for articulation estimation have primarily focused on controlled settings, assuming either fixed camera viewpoints or direct observations of various object states, which tend to fail in more realistic unconstrained environments. In contrast, humans effortlessly infer articulation by watching others manipulate objects. Inspired by this, we introduce ArtiPoint, a novel estimation framework that can infer articulated object models under dynamic camera motion and partial observability. By combining deep point tracking with a factor graph optimization framework, ArtiPoint robustly estimates articulated part trajectories and articulation axes directly from raw RGB-D videos. To foster future research in this domain, we introduce Arti4D, the first ego-centric in-the-wild dataset that captures articulated object interactions at a scene level, accompanied by articulation labels and ground-truth camera poses. We benchmark ArtiPoint against a range of classical and learning-based baselines, demonstrating its superior performance on Arti4D. We make code and Arti4D publicly available at https://artipoint.cs.uni-freiburg.de.
Chinese: ArtiPoint是一种新颖的框架,能够在动态相机运动和部分可观测条件下从原始RGB-D视频中稳健估计铰接物体模型,并在新发布的Arti4D数据集上超越了现有方法。
English: ArtiPoint is a novel framework that robustly estimates articulated object models from raw RGB-D videos under dynamic camera motion and partial observability, outperforming existing methods on the newly introduced Arti4D dataset.

Authors:Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Chao Sun, Rongzhou Zhang, Guanyu Zhou, Lijie Wen, Xuming Hu
Title: OmniDPO: A Preference Optimization Framework to Address Omni-Modal Hallucination
Abstract:
Recently, Omni-modal large language models (OLLMs) have sparked a new wave of research, achieving impressive results in tasks such as audio-video understanding and real-time environment perception. However, hallucination issues still persist. Similar to the bimodal setting, the priors from the text modality tend to dominate, leading OLLMs to rely more heavily on textual cues while neglecting visual and audio information. In addition, fully multimodal scenarios introduce new challenges. Most existing models align visual or auditory modalities with text independently during training, while ignoring the intrinsic correlations between video and its corresponding audio. This oversight results in hallucinations when reasoning requires interpreting hidden audio cues embedded in video content. To address these challenges, we propose OmniDPO, a preference-alignment framework designed to mitigate hallucinations in OLLMs. Specifically, OmniDPO incorporates two strategies: (1) constructing text-preference sample pairs to enhance the model's understanding of audio-video interactions; and (2) constructing multimodal-preference sample pairs to strengthen the model's attention to visual and auditory information. By tackling both challenges, OmniDPO effectively improves multimodal grounding and reduces hallucination. Experiments conducted on two OLLMs demonstrate that OmniDPO not only effectively mitigates multimodal hallucinations but also significantly enhances the models' reasoning capabilities across modalities. All code and datasets will be released upon paper acceptance.
中文: OmniDPO是一个偏好对齐框架,通过增强音频-视频交互理解和强化对视觉与听觉信息的关注,有效缓解全模态大语言模型中的幻觉问题,从而提升多模态基础与跨模态推理能力。
English: OmniDPO is a preference-alignment framework that addresses hallucination issues in Omni-modal large language models by enhancing audio-video interaction understanding and strengthening attention to visual and auditory information, thereby improving multimodal grounding and reasoning capabilities.

Authors:Terry Jingchen Zhang, Gopal Dev, Ning Wang, Nicole Ni, Wenyuan Jiang, Yinya Huang, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin
Title: Beyond Memorization: Reasoning-Driven Synthesis as a Mitigation Strategy Against Benchmark Contamination
Abstract:
Capability evaluation of large language models (LLMs) is increasingly shadowed by rising concerns of data contamination that cast doubts on whether static benchmarks measure genuine reasoning or mere memorization. We present an empirical study using an infinitely scalable framework to synthesize research-level QA directly from arXiv papers, harnessing the natural temporal structure of research publications where performance decay after knowledge cutoffs may indicate potential contamination. We evaluated 4 frontier model represented by 2 models of different knowledge cutoff dates per family on 1,643 multi-step reasoning questions synthesized from 20,277 arXiv papers stratified over 26 months, covering at least 6 months before and after all cutoff dates. Our results consistently showed a lack of significant performance decay near knowledge cutoff dates for models of various sizes, developers, and release dates. We further performed a comparative analysis with previous longitudinal studies that reported significant post-cutoff performance decay using directly retrieved questions based on public data. we hypothesize that the multi-step reasoning required by our synthesis pipeline offered additional complexity that goes deeper than shallow memorization, which effectively serves a mitigation strategy against benchmark contamination. We fully open source our code and dataset to aid reproducibility and advocate for a paradigm shift that prioritize reasoning-driven synthesis to construct benchmarks over simply collecting newly released questions periodically.
中文: 本研究提出一种可扩展框架,通过arXiv论文生成研究级问题,发现大型语言模型在知识截止日期附近未出现显著性能衰退,表明多步推理能超越单纯记忆,有效缓解基准测试污染。
English: This study introduces a scalable framework to generate research-level questions from arXiv papers, finding no significant performance decay in large language models near their knowledge cutoff dates, which suggests that multi-step reasoning mitigates benchmark contamination by transcending mere memorization.

Authors:Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, Jingdong Wang
Title: Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Abstract:
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I), whether instantiated as assembled unified frameworks which couple powerful vision-language model (VLM) with diffusion-based generator, or as naive Unified Multimodal Models with an early fusion of understanding and generation modalities. We contend that in current unified frameworks, the crucial capability of multimodal generative reasoning which encompasses instruction understanding, grounding, and image referring for identity preservation and faithful reconstruction, is intrinsically entangled with high-fidelity synthesis. In this work, we introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal ``kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs. This design delegates the complex ability of multimodal generative reasoning to powerful VLM while reserving diffusion model's role for high-quality visual synthesis. To achieve this, we propose a three-stage progressive training strategy. First, we connect the VLM to a lightweight diffusion head via multimodal kontext tokens to unleash the VLM's generative reasoning ability. Second, we scale this head to a large, pre-trained diffusion model to enhance visual detail and realism. Finally, we introduce a low-level image encoder to improve image fidelity and perform instruction tuning on downstream tasks. Furthermore, we build a comprehensive data pipeline integrating real, synthetic, and open-source datasets, covering diverse multimodal reference-to-image scenarios, including image generation, instruction-driven editing, customized generation, and multi-subject composition. Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
中文: 本文提出Query-Kontext方法,通过语义线索和粗粒度图像条件构建的多模态"上下文"连接视觉语言模型与扩散模型,将生成推理能力与视觉合成解耦,在多种图像生成与编辑任务中达到与先进方法相当或更优的性能。
English: This paper introduces Query-Kontext, a novel approach that decouples multimodal generative reasoning from visual synthesis by bridging vision-language models with diffusion models through semantic and coarse-grained image contexts, achieving competitive or superior performance across diverse image generation and editing tasks.

Authors:Yanbin Fu, Hong Jiao, Tianyi Zhou, Robert W. Lissitz, Nan Zhang, Ming Li, Qingshu Xu, Sydney Peters
Title: Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests
Abstract:
Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size increase alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual-E5-large-instruct model. The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models, particularly for the more fine-grained skill alignment. To better understand model misclassifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embeddings were conducted. These analyses consistently showed that certain skills in SAT and PSAT were semantically too close, providing evidence for the observed misclassification.
中文: 本研究证明,通过利用全面的题目文本数据,经过微调的小型语言模型能有效实现测试题目自动对齐,其性能优于基于嵌入的模型,并通过详细的错误分类分析揭示了部分技能间的语义重叠问题。
English: This study demonstrates that fine-tuned small language models effectively automate test item alignment, outperforming embedding-based models by leveraging comprehensive item text data and revealing semantic overlaps between certain skills through detailed misclassification analysis.

Authors:Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, Limin Wang
Title: MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
Abstract:
Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.
中文摘要:MotionRAG通过检索参考视频中的运动先验并采用上下文感知运动适配技术,以模块化设计提升视频生成的运动真实感,在推理时仅需极低计算开销即可实现跨领域零样本泛化。
English Summary: MotionRAG enhances video generation realism by retrieving and adapting motion priors from reference videos through a modular framework that integrates motion features into diffusion models with minimal computational cost.

Authors:Zichao Shen, Chen Gao, Jiaqi Yuan, Tianchen Zhu, Xingcheng Fu, Qingyun Sun
Title: SDA-PLANNER: State-Dependency Aware Adaptive Planner for Embodied Task Planning
Abstract:
Embodied task planning requires agents to produce executable actions in a close-loop manner within the environment. With progressively improving capabilities of LLMs in task decomposition, planning, and generalization, current embodied task planning methods adopt LLM-based architecture.However, existing LLM-based planners remain limited in three aspects, i.e., fixed planning paradigms, lack of action sequence constraints, and error-agnostic. In this work, we propose SDA-PLANNER, enabling an adaptive planning paradigm, state-dependency aware and error-aware mechanisms for comprehensive embodied task planning. Specifically, SDA-PLANNER introduces a State-Dependency Graph to explicitly model action preconditions and effects, guiding the dynamic revision. To handle execution error, it employs an error-adaptive replanning strategy consisting of Error Backtrack and Diagnosis and Adaptive Action SubTree Generation, which locally reconstructs the affected portion of the plan based on the current environment state. Experiments demonstrate that SDA-PLANNER consistently outperforms baselines in success rate and goal completion, particularly under diverse error conditions.
中文: SDA-PLANNER提出了一种具有状态依赖感知和错误自适应重规划能力的具身任务规划框架,在处理执行错误和提高任务成功率方面显著优于现有方法。
English: SDA-PLANNER introduces an adaptive embodied task planning framework with state-dependency awareness and error-adaptive replanning, significantly outperforming existing methods in handling execution errors and improving task success rates.

Authors:Ruiqi Luo, Ran Jin, Zhenglong Li, Kaixi Hu, Xiaohui Tao, Lin Li
Title: HiFIRec: Towards High-Frequency yet Low-Intention Behaviors for Multi-Behavior Recommendation
Abstract:
Multi-behavior recommendation leverages multiple types of user-item interactions to address data sparsity and cold-start issues, providing personalized services in domains such as healthcare and e-commerce. Most existing methods utilize graph neural networks to model user intention in a unified manner, which inadequately considers the heterogeneity across different behaviors. Especially, high-frequency yet low-intention behaviors may implicitly contain noisy signals, and frequent patterns that are plausible while misleading, thereby hindering the learning of user intentions. To this end, this paper proposes a novel multi-behavior recommendation method, HiFIRec, that corrects the effect of high-frequency yet low-intention behaviors by differential behavior modeling. To revise the noisy signals, we hierarchically suppress it across layers by extracting neighborhood information through layer-wise neighborhood aggregation and further capturing user intentions through adaptive cross-layer feature fusion. To correct plausible frequent patterns, we propose an intensity-aware non-sampling strategy that dynamically adjusts the weights of negative samples. Extensive experiments on two benchmarks show that HiFIRec relatively improves HR@10 by 4.21%-6.81% over several state-of-the-art methods.
中文摘要:HiFIRec是一种新型多行为推荐方法,通过差异行为建模、分层噪声抑制和强度感知非采样策略,有效修正高频低意图行为中的噪声信号和误导性频繁模式,在基准测试中相对现有最优方法提升了4.21%-6.81%的HR@10指标。
English Summary: HiFIRec is a novel multi-behavior recommendation method that addresses noisy signals from high-frequency but low-intention behaviors through differential behavior modeling, hierarchical noise suppression, and an intensity-aware non-sampling strategy, achieving significant performance improvements over existing methods.

Authors:Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li
Title: RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
Abstract:
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.
中文:本文提出RADAR多智能体协作框架,通过将风险解构为显性、隐性和非风险子空间重构安全评估范式,并借助动态辩论机制在基准测试中实现风险识别准确率28.87%的提升。
English: This paper introduces RADAR, a multi-agent collaborative framework that redefines LLM safety evaluation by decomposing risks into explicit, implicit, and non-risk subspaces, achieving a 28.87% accuracy improvement over baselines through dynamic debate mechanisms.

Authors:Hao Chen, Tao Han, Jie Zhang, Song Guo, Lei Bai
Title: STCast: Adaptive Boundary Alignment for Global and Regional Weather Forecasting
Abstract:
To gain finer regional forecasts, many works have explored the regional integration from the global atmosphere, e.g., by solving boundary equations in physics-based methods or cropping regions from global forecasts in data-driven methods. However, the effectiveness of these methods is often constrained by static and imprecise regional boundaries, resulting in poor generalization ability. To address this issue, we propose Spatial-Temporal Weather Forecasting (STCast), a novel AI-driven framework for adaptive regional boundary optimization and dynamic monthly forecast allocation. Specifically, our approach employs a Spatial-Aligned Attention (SAA) mechanism, which aligns global and regional spatial distributions to initialize boundaries and adaptively refines them based on attention-derived alignment patterns. Furthermore, we design a Temporal Mixture-of-Experts (TMoE) module, where atmospheric variables from distinct months are dynamically routed to specialized experts using a discrete Gaussian distribution, enhancing the model's ability to capture temporal patterns. Beyond global and regional forecasting, we evaluate our STCast on extreme event prediction and ensemble forecasting. Experimental results demonstrate consistent superiority over state-of-the-art methods across all four tasks.
中文摘要:提出的STCast框架通过空间对齐注意力自适应优化区域边界,并利用时间混合专家动态分配月度预报,在四项任务中均优于现有最优方法,提升了区域天气预报的精准度。
English Summary: The proposed STCast framework enhances regional weather forecasting by adaptively optimizing boundaries with spatial-aligned attention and dynamically allocating monthly forecasts through temporal mixture-of-experts, consistently outperforming state-of-the-art methods across multiple tasks.

Authors:Junwei Lan, Jianlyu Chen, Zheng Liu, Chaofan Li, Siqi Bao, Defu Lian
Title: Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval
Abstract:
With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine-grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning-enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose Retro*, a novel approach for reasoning-intensive document retrieval. Our method introduces a rubric-based relevance scoring mechanism, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine-grained, interpretable relevance score. Retro* also supports test-time scaling by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro*'s reasoning capabilities, we introduce a novel reinforcement learning algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro* outperforms existing document retrieval methods with notable advantages, leading to state-of-the-art performance on the BRIGHT benchmark.
中文: Retro*通过引入基于规则的评分机制和强化学习,提升了文档检索中的细粒度推理能力,在BRIGHT基准测试中实现了最先进的性能,提高了相关性评估的准确性和可扩展性。
English: Retro* introduces a rubric-based scoring mechanism and reinforcement learning to enhance fine-grained reasoning for document retrieval, achieving state-of-the-art performance on the BRIGHT benchmark by improving relevance assessment and scalability.

Authors:Yunyao Zhang, Xinglang Zhang, Junxi Sheng, Wenbing Li, Junqing Yu, Wei Yang, Zikai Song
Title: From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning
Abstract:
Logical reasoning is a fundamental capability of large language models (LLMs). However, existing studies largely overlook the interplay between logical complexity and semantic complexity, resulting in methods that struggle to address challenging scenarios involving abstract propositions, ambiguous contexts, and conflicting stances, which are central to human reasoning. For this gap, we propose LogicAgent, a semiotic-square-guided framework designed to jointly address logical complexity and semantic complexity. LogicAgent explicitly performs multi-perspective deduction in first-order logic (FOL), while mitigating vacuous reasoning through existential import checks that incorporate a three-valued decision scheme (True, False, Uncertain) to handle boundary cases more faithfully. Furthermore, to overcome the semantic simplicity and low logical complexity of existing datasets, we introduce RepublicQA, a benchmark that reaches college-level difficulty (FKGL = 11.94) and exhibits substantially greater lexical and structural diversity than prior benchmarks. RepublicQA is grounded in philosophical concepts, featuring abstract propositions and systematically organized contrary and contradictory relations, making it the most semantically rich resource for evaluating logical reasoning. Experiments demonstrate that LogicAgent achieves state-of-the-art performance on RepublicQA, with a 6.25% average gain over strong baselines, and generalizes effectively to mainstream logical reasoning benchmarks including ProntoQA, ProofWriter, FOLIO, and ProverQA, achieving an additional 7.05% average gain. These results highlight the strong effectiveness of our semiotic-grounded multi-perspective reasoning in boosting LLMs' logical performance.
中文: 该研究提出了LogicAgent框架,通过多视角推理和存在性检查应对逻辑与语义复杂性,并创建了高难度基准RepublicQA,在多个数据集上实现了最先进性能且显著提升。
English: The study introduces LogicAgent, a framework that addresses both logical and semantic complexities through multi-perspective reasoning and existential import checks, and RepublicQA, a challenging benchmark, achieving state-of-the-art performance with significant gains across multiple datasets.

Authors:Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
Title: Watermarking Diffusion Language Models
Abstract:
We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.
Chinese Summary: 本文针对扩散语言模型首次提出定制水印方案,通过在不完整上下文中应用期望水印并增强令牌的水印强度,实现了超过99%的检测准确率且对生成质量影响极小。
English Summary: This paper introduces the first watermark specifically designed for diffusion language models, which generates tokens in arbitrary order, by applying watermarks in expectation over incomplete contexts and promoting tokens that enhance watermark strength, achieving over 99% detection accuracy with minimal quality impact.

Authors:Yuanyi Wang, Yanggan Gu, Yiming Zhang, Qi Zhou, Zhaoyi Yan, Congkai Xie, Xinyao Wang, Jianbo Yuan, Hongxia Yang
Title: Model Merging Scaling Laws in Large Language Models
Abstract:
We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.
This research identifies a power law governing language model merging that predicts performance gains from scaling model size and expert numbers, enabling predictive planning for efficient model composition as an alternative to multitask training.
English Summary:

Authors:Bingyang Cui, Yujie Zhang, Qi Yang, Zhu Li, Yiling Xu
Title: Towards Fine-Grained Text-to-3D Quality Assessment: A Benchmark and A Two-Stage Rank-Learning Metric
Abstract:
Recent advances in Text-to-3D (T23D) generative models have enabled the synthesis of diverse, high-fidelity 3D assets from textual prompts. However, existing challenges restrict the development of reliable T23D quality assessment (T23DQA). First, existing benchmarks are outdated, fragmented, and coarse-grained, making fine-grained metric training infeasible. Moreover, current objective metrics exhibit inherent design limitations, resulting in non-representative feature extraction and diminished metric robustness. To address these limitations, we introduce T23D-CompBench, a comprehensive benchmark for compositional T23D generation. We define five components with twelve sub-components for compositional prompts, which are used to generate 3,600 textured meshes from ten state-of-the-art generative models. A large-scale subjective experiment is conducted to collect 129,600 reliable human ratings across different perspectives. Based on T23D-CompBench, we further propose Rank2Score, an effective evaluator with two-stage training for T23DQA. Rank2Score enhances pairwise training via supervised contrastive regression and curriculum learning in the first stage, and subsequently refines predictions using mean opinion scores to achieve closer alignment with human judgments in the second stage. Extensive experiments and downstream applications demonstrate that Rank2Score consistently outperforms existing metrics across multiple dimensions and can additionally serve as a reward function to optimize generative models. The project is available at https://cbysjtu.github.io/Rank2Score/.
中文: 针对文本到3D模型质量评估中基准过时和指标局限的问题,本研究提出了T23D-CompBench综合基准和Rank2Score评估器,通过两阶段训练显著提升了与人类判断的一致性。
English: Recent advances in Text-to-3D models face challenges in quality assessment due to outdated benchmarks and limited metrics, prompting the introduction of T23D-CompBench and Rank2Score, a novel evaluator that enhances alignment with human judgments through two-stage training.

Authors:Junyu Wang, Zizhen Lin, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang
Title: LORT: Locally Refined Convolution and Taylor Transformer for Monaural Speech Enhancement
Abstract:
Achieving superior enhancement performance while maintaining a low parameter count and computational complexity remains a challenge in the field of speech enhancement. In this paper, we introduce LORT, a novel architecture that integrates spatial-channel enhanced Taylor Transformer and locally refined convolution for efficient and robust speech enhancement. We propose a Taylor multi-head self-attention (T-MSA) module enhanced with spatial-channel enhancement attention (SCEA), designed to facilitate inter-channel information exchange and alleviate the spatial attention limitations inherent in Taylor-based Transformers. To complement global modeling, we further present a locally refined convolution (LRC) block that integrates convolutional feed-forward layers, time-frequency dense local convolutions, and gated units to capture fine-grained local details. Built upon a U-Net-like encoder-decoder structure with only 16 output channels in the encoder, LORT processes noisy inputs through multi-resolution T-MSA modules using alternating downsampling and upsampling operations. The enhanced magnitude and phase spectra are decoded independently and optimized through a composite loss function that jointly considers magnitude, complex, phase, discriminator, and consistency objectives. Experimental results on the VCTK+DEMAND and DNS Challenge datasets demonstrate that LORT achieves competitive or superior performance to state-of-the-art (SOTA) models with only 0.96M parameters, highlighting its effectiveness for real-world speech enhancement applications with limited computational resources.
中文: LORT是一种结合空间通道增强泰勒变换器和局部优化卷积的新型语音增强架构,仅用0.96M参数即可实现业界领先性能,适用于计算资源受限的实际应用场景。
English: LORT is a novel speech enhancement architecture combining spatial-channel enhanced Taylor Transformer and locally refined convolution, achieving state-of-the-art performance with only 0.96M parameters for efficient real-world applications.

Authors:Ibne Farabi Shihab, Weiheng Chai, Jiyang Wang, Sanjeda Akter, Senem Velipasalar Gursoy, Anuj Sharma
Title: Calibrated and Resource-Aware Super-Resolution for Reliable Driver Behavior Analysis
Abstract:
Driver monitoring systems require not just high accuracy but reliable, well-calibrated confidence scores for safety-critical deployment. While direct low-resolution training yields high overall accuracy, it produces poorly calibrated predictions that can be dangerous in safety-critical scenarios. We propose a resource-aware adaptive super-resolution framework that optimizes for model calibration and high precision-recall on critical events. Our approach achieves state-of-the-art performance on safety-centric metrics: best calibration (ECE of 5.8\% vs 6.2\% for LR-trained baselines), highest AUPR for drowsiness detection (0.78 vs 0.74), and superior precision-recall for phone use detection (0.74 vs 0.71). A lightweight artifact detector (0.3M parameters, 5.2ms overhead) provides additional safety by filtering SR-induced hallucinations. While LR-trained video models serve as strong general-purpose baselines, our adaptive framework represents the state-of-the-art solution for safety-critical applications where reliability is paramount.
中文: 该自适应超分辨率框架通过优化模型校准和关键安全事件的精确召回,提升了驾驶员监控系统的可靠性,在校准和检测指标上达到最优性能,并采用轻量级伪影检测器增强安全保障。
English: The proposed adaptive super-resolution framework enhances driver monitoring systems by optimizing model calibration and precision-recall for critical safety events, achieving state-of-the-art performance in calibration and detection metrics while incorporating a lightweight artifact detector for additional safety.

Authors:Sydney Peters, Nan Zhang, Hong Jiao, Ming Li, Tianyi Zhou, Robert Lissitz
Title: Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review
Abstract:
Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments. Traditional approaches to item difficulty modeling rely on field testing and classical test theory (CTT)-based item analysis or item response theory (IRT) calibration, which can be time-consuming and costly. To overcome these challenges, text-based approaches leveraging machine learning and language models, have emerged as promising alternatives. This paper reviews and synthesizes 37 articles on automated item difficulty prediction in large-scale assessment settings published through May 2025. For each study, we delineate the dataset, difficulty parameter, subject domain, item type, number of items, training and test data split, input, features, model, evaluation criteria, and model performance outcomes. Results showed that although classic machine learning models remain relevant due to their interpretability, state-of-the-art language models, using both small and large transformer-based architectures, can capture syntactic and semantic patterns without the need for manual feature engineering. Uniquely, model performance outcomes were summarized to serve as a benchmark for future research and overall, text-based methods have the potential to predict item difficulty with root mean square error (RMSE) as low as 0.165, Pearson correlation as high as 0.87, and accuracy as high as 0.806. The review concludes by discussing implications for practice and outlining future research directions for automated item difficulty modeling.
中文摘要:本文综述了37项关于自动题目难度预测的研究,发现尽管经典机器学习模型具有可解释性优势,但先进语言模型能有效捕捉语言特征,并实现低至0.165的均方根误差等优异性能指标。
English summary: This paper reviews 37 studies on automated item difficulty prediction, finding that while classic machine learning models offer interpretability, advanced language models can effectively capture linguistic patterns and achieve strong performance metrics like RMSE as low as 0.165.

Authors:Wenhao Yang, Lin Li, Xiaohui Tao, Kaize Shi
Title: Factor Decorrelation Enhanced Data Removal from Deep Predictive Models
Abstract:
The imperative of user privacy protection and regulatory compliance necessitates sensitive data removal in model training, yet this process often induces distributional shifts that undermine model performance-particularly in out-of-distribution (OOD) scenarios. We propose a novel data removal approach that enhances deep predictive models through factor decorrelation and loss perturbation. Our approach introduces: (1) a discriminative-preserving factor decorrelation module employing dynamic adaptive weight adjustment and iterative representation updating to reduce feature redundancy and minimize inter-feature correlations. (2) a smoothed data removal mechanism with loss perturbation that creates information-theoretic safeguards against data leakage during removal operations. Extensive experiments on five benchmark datasets show that our approach outperforms other baselines and consistently achieves high predictive accuracy and robustness even under significant distribution shifts. The results highlight its superior efficiency and adaptability in both in-distribution and out-of-distribution scenarios.
中文: 本文提出一种新颖的数据移除方法,通过因子解相关和损失扰动增强模型鲁棒性与预测精度,在保证隐私合规的同时有效应对分布偏移问题。
English: This paper introduces a novel data removal method that enhances model robustness and predictive accuracy through factor decorrelation and loss perturbation, effectively addressing distribution shifts while maintaining privacy compliance.

Authors:Valentyn Melnychuk, Stefan Feuerriegel
Title: GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes
Abstract:
Various deep generative models have been proposed to estimate potential outcomes distributions from observational data. However, none of them have the favorable theoretical property of general Neyman-orthogonality and, associated with it, quasi-oracle efficiency and double robustness. In this paper, we introduce a general suite of generative Neyman-orthogonal (doubly-robust) learners that estimate the conditional distributions of potential outcomes. Our proposed GDR-learners are flexible and can be instantiated with many state-of-the-art deep generative models. In particular, we develop GDR-learners based on (a) conditional normalizing flows (which we call GDR-CNFs), (b) conditional generative adversarial networks (GDR-CGANs), (c) conditional variational autoencoders (GDR-CVAEs), and (d) conditional diffusion models (GDR-CDMs). Unlike the existing methods, our GDR-learners possess the properties of quasi-oracle efficiency and rate double robustness, and are thus asymptotically optimal. In a series of (semi-)synthetic experiments, we demonstrate that our GDR-learners are very effective and outperform the existing methods in estimating the conditional distributions of potential outcomes.
中文摘要:本文提出了一套生成式Neyman正交学习器,具备准预言机效率和双重鲁棒性,能有效估计潜在结果的条件分布,在实验中优于现有方法。
English Summary: This paper introduces a suite of generative Neyman-orthogonal learners that provide quasi-oracle efficiency and double robustness for estimating potential outcomes distributions, outperforming existing methods in experiments.

Authors:Thalea Schlender, Catharina J. A. Romme, Yvette M. van der Linden, Luc R. C. W. van Lonkhuijzen, Peter A. N. Bosman, Tanja Alderliesten
Title: PISA: An AI Pipeline for Interpretable-by-design Survival Analysis Providing Multiple Complexity-Accuracy Trade-off Models
Abstract:
Survival analysis is central to clinical research, informing patient prognoses, guiding treatment decisions, and optimising resource allocation. Accurate time-to-event predictions not only improve quality of life but also reveal risk factors that shape clinical practice. For these models to be relevant in healthcare, interpretability is critical: predictions must be traceable to patient-specific characteristics, and risk factors should be identifiable to generate actionable insights for both clinicians and researchers. Traditional survival models often fail to capture non-linear interactions, while modern deep learning approaches, though powerful, are limited by poor interpretability. We propose a Pipeline for Interpretable Survival Analysis (PISA) - a pipeline that provides multiple survival analysis models that trade off complexity and performance. Using multiple-feature, multi-objective feature engineering, PISA transforms patient characteristics and time-to-event data into multiple survival analysis models, providing valuable insights into the survival prediction task. Crucially, every model is converted into simple patient stratification flowcharts supported by Kaplan-Meier curves, whilst not compromising on performance. While PISA is model-agnostic, we illustrate its flexibility through applications of Cox regression and shallow survival trees, the latter avoiding proportional hazards assumptions. Applied to two clinical benchmark datasets, PISA produced interpretable survival models and intuitive stratification flowcharts whilst achieving state-of-the-art performances. Revisiting a prior departmental study further demonstrated its capacity to automate survival analysis workflows in real-world clinical research.
中文: 生存分析在临床研究中至关重要,用于预测患者预后和指导治疗决策,而提出的可解释生存分析流程(PISA)提供了一种灵活、模型无关的方法,在平衡复杂性和性能的同时,通过直观的患者分层流程图和先进的结果确保可解释性。
English: Survival analysis is crucial in clinical research for predicting patient outcomes and guiding treatments, and the proposed Pipeline for Interpretable Survival Analysis (PISA) offers a flexible, model-agnostic approach that balances complexity and performance while ensuring interpretability through intuitive patient stratification flowcharts and state-of-the-art results.

Authors:Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao
Title: DiTraj: training-free trajectory control for video diffusion transformer
Abstract:
Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object's trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens' position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.
Chinese: DiTraj是一种无需训练的框架,通过前景-背景分离引导和采用STD-RoPE调整位置嵌入,增强了基于DiT的视频生成中的轨迹控制能力,有效提升了跨帧注意力和空间一致性。
English: DiTraj is a training-free framework that enhances trajectory control in DiT-based video generation by employing foreground-background separation guidance and modifying position embeddings with STD-RoPE to improve cross-frame attention and spatial consistency.

Authors:Mafalda Malafaia, Peter A. N. Bosman, Coen Rasch, Tanja Alderliesten
Title: Automated and Interpretable Survival Analysis from Multimodal Data
Abstract:
Accurate and interpretable survival analysis remains a core challenge in oncology. With growing multimodal data and the clinical need for transparent models to support validation and trust, this challenge increases in complexity. We propose an interpretable multimodal AI framework to automate survival analysis by integrating clinical variables and computed tomography imaging. Our MultiFIX-based framework uses deep learning to infer survival-relevant features that are further explained: imaging features are interpreted via Grad-CAM, while clinical variables are modeled as symbolic expressions through genetic programming. Risk estimation employs a transparent Cox regression, enabling stratification into groups with distinct survival outcomes. Using the open-source RADCURE dataset for head and neck cancer, MultiFIX achieves a C-index of 0.838 (prediction) and 0.826 (stratification), outperforming the clinical and academic baseline approaches and aligning with known prognostic markers. These results highlight the promise of interpretable multimodal AI for precision oncology with MultiFIX.
Chinese: 提出的MultiFIX框架通过可解释人工智能整合临床与影像数据,在头颈癌数据验证中实现了更优的预测精度和透明风险分层,为精准肿瘤学提供自动化生存分析解决方案。
English: The proposed MultiFIX framework integrates clinical and imaging data through interpretable AI to automate survival analysis, achieving superior predictive accuracy and transparent risk stratification validated on head and neck cancer data.

Authors:Baiqiang Wang, Qian Lou, Mengxin Zheng, Dongfang Zhao
Title: PIR-RAG: A System for Private Information Retrieval in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) has become a foundational component of modern AI systems, yet it introduces significant privacy risks by exposing user queries to service providers. To address this, we introduce PIR-RAG, a practical system for privacy-preserving RAG. PIR-RAG employs a novel architecture that uses coarse-grained semantic clustering to prune the search space, combined with a fast, lattice-based Private Information Retrieval (PIR) protocol. This design allows for the efficient retrieval of entire document clusters, uniquely optimizing for the end-to-end RAG workflow where full document content is required. Our comprehensive evaluation against strong baseline architectures, including graph-based PIR and Tiptoe-style private scoring, demonstrates PIR-RAG's scalability and its superior performance in terms of "RAG-Ready Latency"-the true end-to-end time required to securely fetch content for an LLM. Our work establishes PIR-RAG as a viable and highly efficient solution for privacy in large-scale AI systems.
中文: 检索增强生成(RAG)存在泄露用户查询的隐私风险,因此提出PIR-RAG系统,通过语义聚类和私有信息检索技术,在AI工作流中实现安全高效的文档内容获取。
English: Retrieval-Augmented Generation (RAG) poses privacy risks by exposing user queries, so PIR-RAG is introduced as a practical system using semantic clustering and Private Information Retrieval to securely fetch document content efficiently for AI workflows.

Authors:Sarah Seifi, Anass Ibrahimi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille
Title: GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series
Abstract:
Counterfactual explanations aim to enhance model transparency by showing how inputs can be minimally altered to change predictions. For multivariate time series, existing methods often generate counterfactuals that are invalid, implausible, or unintuitive. We introduce GenFacts, a generative framework based on a class-discriminative variational autoencoder. It integrates contrastive and classification-consistency objectives, prototype-based initialization, and realism-constrained optimization. We evaluate GenFacts on radar gesture data as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts outperforms state-of-the-art baselines in plausibility (+18.7%) and achieves the highest interpretability scores in a human study. These results highlight that plausibility and user-centered interpretability, rather than sparsity alone, are key to actionable counterfactuals in time series data.
中文: GenFacts是一种新颖的生成框架,可为时间序列分类器生成合理且可操作的对抗性解释,在真实性和用户可解释性方面显著优于基线方法。
English: GenFacts is a generative framework that creates plausible and interpretable counterfactual explanations for time series classifiers, demonstrating superior performance in plausibility metrics and user interpretability over baseline methods.

Authors:Sarah Seifi, Anass Ibrahimi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille
Title: GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series
Abstract:
Counterfactual explanations aim to enhance model transparency by illustrating how input modifications can change model predictions. In the multivariate time series domain, existing approaches often produce counterfactuals that lack validity, plausibility, or intuitive interpretability. We present \textbf{GenFacts}, a novel generative framework for producing plausible and actionable counterfactual explanations for time series classifiers. GenFacts introduces a structured approach to latent space modeling and targeted counterfactual synthesis. We evaluate GenFacts on radar gesture recognition as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts consistently outperforms baseline methods in plausibility metrics (+18.7\%) and achieves the highest interpretability scores in user studies. These results underscore that realism and user-centered interpretability, rather than sparsity alone, are vital for actionable counterfactuals in time series applications.
中文: GenFacts是一种新颖的生成框架,可为时间序列分类器生成合理且可操作的对抗性解释,在真实性和用户可解释性方面显著优于基线方法。
English: GenFacts is a generative framework that creates plausible and interpretable counterfactual explanations for time series classifiers, demonstrating superior performance in plausibility metrics and user interpretability over baseline methods.

Authors:Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
Title: Seedream 4.0: Toward Next-generation Multimodal Image Generation
Abstract:
We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.
Chinese: Seedream 4.0 是一款高效的多模态图像生成系统,在统一框架内整合了文生图、图像编辑和多图合成功能,通过优化训练和快速生成高分辨率图像实现了顶尖性能。
English: Seedream 4.0 is an advanced multimodal image generation system that unifies text-to-image synthesis, image editing, and multi-image composition in a single framework, achieving state-of-the-art performance through efficient training and fast high-resolution output.

Authors:Team Seedream, :, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, Xiaowen Jian, Huafeng Kuang, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, Wei Liu, Yanzuo Lu, Zhengxiong Luo, Tongtong Ou, Guang Shi, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Rui Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Wenxu Wu, Yonghui Wu, Xin Xia, Xuefeng Xiao, Shuang Xu, Xin Yan, Ceyuan Yang, Jianchao Yang, Zhonghua Zhai, Chenlin Zhang, Heng Zhang, Qi Zhang, Xinyu Zhang, Yuwei Zhang, Shijia Zhao, Wenliang Zhao, Wenjia Zhu
Title: Seedream 4.0: Toward Next-generation Multimodal Image Generation
Abstract:
We introduce Seedream 4.0, an efficient and high-performance multimodal image generation system that unifies text-to-image (T2I) synthesis, image editing, and multi-image composition within a single framework. We develop a highly efficient diffusion transformer with a powerful VAE which also can reduce the number of image tokens considerably. This allows for efficient training of our model, and enables it to fast generate native high-resolution images (e.g., 1K-4K). Seedream 4.0 is pretrained on billions of text-image pairs spanning diverse taxonomies and knowledge-centric concepts. Comprehensive data collection across hundreds of vertical scenarios, coupled with optimized strategies, ensures stable and large-scale training, with strong generalization. By incorporating a carefully fine-tuned VLM model, we perform multi-modal post-training for training both T2I and image editing tasks jointly. For inference acceleration, we integrate adversarial distillation, distribution matching, and quantization, as well as speculative decoding. It achieves an inference time of up to 1.8 seconds for generating a 2K image (without a LLM/VLM as PE model). Comprehensive evaluations reveal that Seedream 4.0 can achieve state-of-the-art results on both T2I and multimodal image editing. In particular, it demonstrates exceptional multimodal capabilities in complex tasks, including precise image editing and in-context reasoning, and also allows for multi-image reference, and can generate multiple output images. This extends traditional T2I systems into an more interactive and multidimensional creative tool, pushing the boundary of generative AI for both creativity and professional applications. Seedream 4.0 is now accessible on https://www.volcengine.com/experience/ark?launch=seedream.
Chinese: Seedream 4.0 是一款高效的多模态图像生成系统,在统一框架内整合了文生图、图像编辑和多图合成功能,通过优化训练和快速生成高分辨率图像实现了顶尖性能。
English: Seedream 4.0 is an advanced multimodal image generation system that unifies text-to-image synthesis, image editing, and multi-image composition in a single framework, achieving state-of-the-art performance through efficient training and fast high-resolution output.

Authors:Keyu Wang, Bingcong Lu, Zhengxue Cheng, Hengdi Zhang, Li Song
Title: D3Grasp: Diverse and Deformable Dexterous Grasping for General Objects
Abstract:
Achieving diverse and stable dexterous grasping for general and deformable objects remains a fundamental challenge in robotics, due to high-dimensional action spaces and uncertainty in perception. In this paper, we present D3Grasp, a multimodal perception-guided reinforcement learning framework designed to enable Diverse and Deformable Dexterous Grasping. We firstly introduce a unified multimodal representation that integrates visual and tactile perception to robustly grasp common objects with diverse properties. Second, we propose an asymmetric reinforcement learning architecture that exploits privileged information during training while preserving deployment realism, enhancing both generalization and sample efficiency. Third, we meticulously design a training strategy to synthesize contact-rich, penetration-free, and kinematically feasible grasps with enhanced adaptability to deformable and contact-sensitive objects. Extensive evaluations confirm that D3Grasp delivers highly robust performance across large-scale and diverse object categories, and substantially advances the state of the art in dexterous grasping for deformable and compliant objects, even under perceptual uncertainty and real-world disturbances. D3Grasp achieves an average success rate of 95.1% in real-world trials,outperforming prior methods on both rigid and deformable objects benchmarks.
中文摘要:D3Grasp提出了一种多模态强化学习框架,通过融合视觉-触觉感知和非对称训练架构,实现了对可变形物体的鲁棒多样化抓取,在真实实验中达到95.1%的成功率。
English Summary: D3Grasp introduces a multimodal reinforcement learning framework that integrates visual-tactile perception and asymmetric training to achieve robust, diverse grasping of deformable objects, demonstrating 95.1% real-world success.

Authors:Bo Yu, Jianhua Yang, Zetao Du, Yan Huang, Chenglong Li, Liang Wang
Title: Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation
Abstract:
Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.
中文:FMISeg模型通过双向频率特征交互模块和语言引导模块,将临床文本报告与频域视觉特征相结合,显著提升了医学图像分割的精度,在多个数据集上验证了其优越性。
English: The FMISeg model improves medical image segmentation by integrating clinical text reports with frequency-domain visual features through bidirectional and language-guided modules, achieving superior performance on benchmark datasets.

Authors:Ruanjun Li, Ziheng Liu, Yuanming Shi, Jiawei Shao, Chi Zhang, Xuelong Li
Title: Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding
Abstract:
Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost. However, in practice, many approaches struggle to achieve the expected acceleration in such draft-then-verify paradigm even with a well-aligned early-exit head and selected exit position. Our analysis reveals that EESD only pays off when the vast majority of draft tokens are accepted by the LLM. Otherwise, the draft cost may overcome the acceleration gain and lead to a negative speedup. To mitigate this, we propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work so that no effort is wasted on failed predictions. It has two key innovations. We configure the model layers as a pipeline in which early-exit (draft) computations and remaining-layer (verification) computations overlap. We interleave drafting and verification per token. While the LLM is verifying the current token in its final layers, the early-exit path simultaneously drafts the next token. Such a verify-while-draft scheme keeps all units busy and validates tokens on-the-fly analogous to pipelining the speculation and verification stages. Empirical results confirm that PPSD achieves state-of-the-art acceleration in self-speculative LLM inference. On diverse benchmarks, PPSD achieves speedup ratios in the range of 2.01x~3.81x, which gains almost the optimal acceleration at the fixed acceptance rate and exit position, showcasing its advancement in providing efficient self-speculation.
Chinese: 提出的流水线并行自推测解码(PPSD)方法将草稿和验证计算流水线化,消除了预测失败时的资源浪费,在自推测大语言模型推理中实现了2.01倍至3.81倍的最先进加速效果。
English: The proposed Pipeline-Parallel Self-Speculative Decoding (PPSD) method pipelines draft and verification computations to eliminate wasted effort on failed predictions, achieving state-of-the-art acceleration ratios of 2.01x~3.81x in self-speculative LLM inference.

Authors:Fan Xu, Hao Wu, Nan Wang, Lilan Peng, Kun Wang, Wei Gong, Xibin Zhao
Title: Breaking the Discretization Barrier of Continuous Physics Simulation Learning
Abstract:
The modeling of complicated time-evolving physical dynamics from partial observations is a long-standing challenge. Particularly, observations can be sparsely distributed in a seemingly random or unstructured manner, making it difficult to capture highly nonlinear features in a variety of scientific and engineering problems. However, existing data-driven approaches are often constrained by fixed spatial and temporal discretization. While some researchers attempt to achieve spatio-temporal continuity by designing novel strategies, they either overly rely on traditional numerical methods or fail to truly overcome the limitations imposed by discretization. To address these, we propose CoPS, a purely data-driven methods, to effectively model continuous physics simulation from partial observations. Specifically, we employ multiplicative filter network to fuse and encode spatial information with the corresponding observations. Then we customize geometric grids and use message-passing mechanism to map features from original spatial domain to the customized grids. Subsequently, CoPS models continuous-time dynamics by designing multi-scale graph ODEs, while introducing a Markov-based neural auto-correction module to assist and constrain the continuous extrapolations. Comprehensive experiments demonstrate that CoPS advances the state-of-the-art methods in space-time continuous modeling across various scenarios.
Chinese: CoPS是一种创新的数据驱动方法,通过融合乘性滤波网络、定制几何网格和多尺度图ODE神经网络自校正模块,能够从部分观测数据中有效模拟连续物理过程,在多种场景下实现了最先进的时空连续建模性能。
English: CoPS is a novel data-driven method that effectively models continuous physics simulations from partial observations by employing multiplicative filter networks, customized geometric grids, and multi-scale graph ODEs with a neural auto-correction module, advancing state-of-the-art performance in various scenarios.

Authors:Viola Negroni, Davide Salvi, Alessandro Ilic Mezza, Paolo Bestagini, Stefano Tubaro
Title: Attention-based Mixture of Experts for Robust Speech Deepfake Detection
Abstract:
AI-generated speech is becoming increasingly used in everyday life, powering virtual assistants, accessibility tools, and other applications. However, it is also being exploited for malicious purposes such as impersonation, misinformation, and biometric spoofing. As speech deepfakes become nearly indistinguishable from real human speech, the need for robust detection methods and effective countermeasures has become critically urgent. In this paper, we present the ISPL's submission to the SAFE challenge at IH&MMSec 2025, where our system ranked first across all tasks. Our solution introduces a novel approach to audio deepfake detection based on a Mixture of Experts architecture. The proposed system leverages multiple state-of-the-art detectors, combining their outputs through an attention-based gating network that dynamically weights each expert based on the input speech signal. In this design, each expert develops a specialized understanding of the shared training data by learning to capture different complementary aspects of the same input through inductive biases. Experimental results indicate that our method outperforms existing approaches across multiple datasets. We further evaluate and analyze the performance of our system in the SAFE challenge.
中文摘要:人工智能生成语音正被恶意利用于冒充等行为,亟需有效检测手段;本文提出的专家混合架构通过动态加权专业检测器,在多项测试中表现优于现有方法。
English Summary: AI-generated speech is increasingly exploited for malicious purposes like impersonation, necessitating robust detection methods, such as the proposed Mixture of Experts system that outperforms existing approaches by dynamically weighting specialized detectors.

Authors:Houqiang Zhong, Zihan Zheng, Qiang Hu, Yuan Tian, Ning Cao, Lan Xu, Xiaoyun Zhang, Zhengxue Cheng, Li Song, Wenjun Zhang
Title: 4D-MoDe: Towards Editable and Scalable Volumetric Streaming via Motion-Decoupled 4D Gaussian Compression
Abstract:
Volumetric video has emerged as a key medium for immersive telepresence and augmented/virtual reality, enabling six-degrees-of-freedom (6DoF) navigation and realistic spatial interactions. However, delivering high-quality dynamic volumetric content at scale remains challenging due to massive data volume, complex motion, and limited editability of existing representations. In this paper, we present 4D-MoDe, a motion-decoupled 4D Gaussian compression framework designed for scalable and editable volumetric video streaming. Our method introduces a layered representation that explicitly separates static backgrounds from dynamic foregrounds using a lookahead-based motion decomposition strategy, significantly reducing temporal redundancy and enabling selective background/foreground streaming. To capture continuous motion trajectories, we employ a multi-resolution motion estimation grid and a lightweight shared MLP, complemented by a dynamic Gaussian compensation mechanism to model emergent content. An adaptive grouping scheme dynamically inserts background keyframes to balance temporal consistency and compression efficiency. Furthermore, an entropy-aware training pipeline jointly optimizes the motion fields and Gaussian parameters under a rate-distortion (RD) objective, while employing range-based and KD-tree compression to minimize storage overhead. Extensive experiments on multiple datasets demonstrate that 4D-MoDe consistently achieves competitive reconstruction quality with an order of magnitude lower storage cost (e.g., as low as \textbf{11.4} KB/frame) compared to state-of-the-art methods, while supporting practical applications such as background replacement and foreground-only streaming.
中文摘要:4D-MoDe提出运动解耦的四维高斯压缩框架,通过分层表示分离静态背景与动态前景,在显著降低存储成本的同时保持高质量重建,并支持背景替换等实用功能。
English Summary: 4D-MoDe is a motion-decoupled compression framework that enables scalable and editable volumetric video streaming by separating static and dynamic elements, achieving high-quality reconstruction with drastically reduced storage costs.

Authors:Chi Zhang, Mengxin Zheng, Qian Lou, Fan Chen
Title: DiffQ: Unified Parameter Initialization for Variational Quantum Algorithms via Diffusion Models
Abstract:
Variational Quantum Algorithms (VQAs) are widely used in the noisy intermediate-scale quantum (NISQ) era, but their trainability and performance depend critically on initialization parameters that shape the optimization landscape. Existing machine learning-based initializers achieve state-of-the-art results yet remain constrained to single-task domains and small datasets of only hundreds of samples. We address these limitations by reformulating VQA parameter initialization as a generative modeling problem and introducing DiffQ, a parameter initializer based on the Denoising Diffusion Probabilistic Model (DDPM). To support robust training and evaluation, we construct a dataset of 15,085 instances spanning three domains and five representative tasks. Experiments demonstrate that DiffQ surpasses baselines, reducing initial loss by up to 8.95 and convergence steps by up to 23.4%.
中文:DiffQ采用生成式建模方法,通过去噪扩散模型改进变分量子算法的参数初始化,在跨多任务和大规模数据集上显著降低初始损失和收敛步数,优于现有方法。
English: DiffQ introduces a generative modeling approach using denoising diffusion to enhance variational quantum algorithm initialization, outperforming existing methods by significantly reducing initial loss and convergence steps across multiple tasks and a large dataset.

Authors:Chi Zhang, Mengxin Zheng, Qian Lou, Hui Min Leung, Fan Chen
Title: VQEzy: An Open-Source Dataset for Parameter Initialize in Variational Quantum Eigensolvers
Abstract:
Variational Quantum Eigensolvers (VQEs) are a leading class of noisy intermediate-scale quantum (NISQ) algorithms, whose performance is highly sensitive to parameter initialization. Although recent machine learning-based initialization methods have achieved state-of-the-art performance, their progress has been limited by the lack of comprehensive datasets. Existing resources are typically restricted to a single domain, contain only a few hundred instances, and lack complete coverage of Hamiltonians, ansatz circuits, and optimization trajectories. To overcome these limitations, we introduce VQEzy, the first large-scale dataset for VQE parameter initialization. VQEzy spans three major domains and seven representative tasks, comprising 12,110 instances with full VQE specifications and complete optimization trajectories. The dataset is available online, and will be continuously refined and expanded to support future research in VQE optimization.
中文摘要:VQEzy作为首个大规模数据集被推出,旨在解决现有变分量子本征求解器参数初始化资源的局限性,该数据集横跨多个领域,包含超过12,000个完整配置实例和优化轨迹。
English Summary: VQEzy is introduced as the first large-scale dataset to overcome the limitations of existing resources for Variational Quantum Eigensolver parameter initialization, providing comprehensive coverage across multiple domains with over 12,000 fully specified instances and complete optimization trajectories.

Authors:Chi Zhang, Mengxin Zheng, Qian Lou, Hui Min Leung, Fan Chen
Title: VQEzy: An Open-Source Dataset for Parameter Initialization in Variational Quantum Eigensolvers
Abstract:
Variational Quantum Eigensolvers (VQEs) are a leading class of noisy intermediate-scale quantum (NISQ) algorithms, whose performance is highly sensitive to parameter initialization. Although recent machine learning-based initialization methods have achieved state-of-the-art performance, their progress has been limited by the lack of comprehensive datasets. Existing resources are typically restricted to a single domain, contain only a few hundred instances, and lack complete coverage of Hamiltonians, ansatz circuits, and optimization trajectories. To overcome these limitations, we introduce VQEzy, the first large-scale dataset for VQE parameter initialization. VQEzy spans three major domains and seven representative tasks, comprising 12,110 instances with full VQE specifications and complete optimization trajectories. The dataset is available online, and will be continuously refined and expanded to support future research in VQE optimization.
中文摘要:VQEzy作为首个大规模数据集被推出,旨在解决现有变分量子本征求解器参数初始化资源的局限性,该数据集横跨多个领域,包含超过12,000个完整配置实例和优化轨迹。
English Summary: VQEzy is introduced as the first large-scale dataset to overcome the limitations of existing resources for Variational Quantum Eigensolver parameter initialization, providing comprehensive coverage across multiple domains with over 12,000 fully specified instances and complete optimization trajectories.

Authors:Runjia Zeng, James Chenhao Liang, Cheng Han, Zhiwen Cao, Jiahao Liu, Xiaojun Quan, Yingjie Victor Chen, Lifu Huang, Tong Geng, Qifan Wang, Dongfang Liu
Title: Probabilistic Token Alignment for Large Language Model Fusion
Abstract:
Training large language models (LLMs) from scratch can yield models with unique functionalities and strengths, but it is costly and often leads to redundant capabilities. A more cost-effective alternative is to fuse existing pre-trained LLMs with different architectures into a more powerful model. However, a key challenge in existing model fusion is their dependence on manually predefined vocabulary alignment, which may not generalize well across diverse contexts, leading to performance degradation in several evaluation. To solve this, we draw inspiration from distribution learning and propose the probabilistic token alignment method as a general and soft mapping for alignment, named as PTA-LLM. Our approach innovatively reformulates token alignment into a classic mathematical problem: optimal transport, seamlessly leveraging distribution-aware learning to facilitate more coherent model fusion. Apart from its inherent generality, PTA-LLM exhibits interpretability from a distributional perspective, offering insights into the essence of the token alignment. Empirical results demonstrate that probabilistic token alignment enhances the target model's performance across multiple capabilities. Our code is avaliable at https://runjia.tech/neurips_pta-llm/.
中文摘要:PTA-LLM提出了一种基于最优传输的概率令牌对齐方法,无需手动词汇对齐即可实现预训练语言模型的更连贯融合,并在多项能力上提升性能。
English Summary: PTA-LLM introduces a probabilistic token alignment method using optimal transport to enable more coherent fusion of pre-trained language models without manual vocabulary alignment, improving performance across multiple capabilities.

Authors:Dehao Zhang, Malu Zhang, Shuai Wang, Jingya Wang, Wenjie Wei, Zeyu Ma, Guoqing Wang, Yang Yang, HaiZhou Li
Title: Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling
Abstract:
The explosive growth in sequence length has intensified the demand for effective and efficient long sequence modeling. Benefiting from intrinsic oscillatory membrane dynamics, Resonate-and-Fire (RF) neurons can efficiently extract frequency components from input signals and encode them into spatiotemporal spike trains, making them well-suited for long sequence modeling. However, RF neurons exhibit limited effective memory capacity and a trade-off between energy efficiency and training speed on complex temporal tasks. Inspired by the dendritic structure of biological neurons, we propose a Dendritic Resonate-and-Fire (D-RF) model, which explicitly incorporates a multi-dendritic and soma architecture. Each dendritic branch encodes specific frequency bands by utilizing the intrinsic oscillatory dynamics of RF neurons, thereby collectively achieving comprehensive frequency representation. Furthermore, we introduce an adaptive threshold mechanism into the soma structure that adjusts the threshold based on historical spiking activity, reducing redundant spikes while maintaining training efficiency in long sequence tasks. Extensive experiments demonstrate that our method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training. These results underscore its potential as an effective and efficient solution for long sequence modeling on edge platforms.
中文摘要:树突谐振发放模型通过多树突频率编码和自适应阈值机制,在保持计算效率的同时以稀疏脉冲实现竞争性精度,为长序列建模提供了高效解决方案。
English Summary: The Dendritic Resonate-and-Fire model enhances long sequence modeling by incorporating multi-dendritic encoding of frequency bands and an adaptive threshold mechanism, achieving competitive accuracy with sparse spikes while maintaining computational efficiency.

Authors:Wataru Nakata, Yuki Saito, Yota Ueda, Hiroshi Saruwatari
Title: Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing
Abstract:
Large-scale text-to-speech (TTS) systems are limited by the scarcity of clean, multilingual recordings. We introduce Sidon, a fast, open-source speech restoration model that converts noisy in-the-wild speech into studio-quality speech and scales to dozens of languages. Sidon consists of two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech and vocoder trained to synthesize restored speech from the cleansed features. Sidon achieves restoration performance comparable to Miipher: Google's internal speech restoration model with the aim of dataset cleansing for speech synthesis. Sidon is also computationally efficient, running up to 3,390 times faster than real time on a single GPU. We further show that training a TTS model using a Sidon-cleansed automatic speech recognition corpus improves the quality of synthetic speech in a zero-shot setting. Code and model are released to facilitate reproducible dataset cleansing for the research community.
中文摘要:Sidon是一款快速开源语音修复模型,能将嘈杂的真实环境语音转换为多语言的高质量录音,从而提升文本转语音系统的训练效果,其处理速度比实时快500倍。
English Summary: Sidon is a fast, open-source speech restoration model that cleans noisy speech into studio-quality audio across multiple languages, enabling improved text-to-speech training and running 500 times faster than real time.

Authors:Wataru Nakata, Yuki Saito, Yota Ueda, Hiroshi Saruwatari
Title: Sidon: Fast and Robust Open-Source Multilingual Speech Restoration for Large-scale Dataset Cleansing
Abstract:
Large-scale text-to-speech (TTS) systems are limited by the scarcity of clean, multilingual recordings. We introduce Sidon, a fast, open-source speech restoration model that converts noisy in-the-wild speech into studio-quality speech and scales to dozens of languages. Sidon consists of two models: w2v-BERT 2.0 finetuned feature predictor to cleanse features from noisy speech and vocoder trained to synthesize restored speech from the cleansed features. Sidon achieves restoration performance comparable to Miipher: Google's internal speech restoration model with the aim of dataset cleansing for speech synthesis. Sidon is also computationally efficient, running up to 500 times faster than real time on a single GPU. We further show that training a TTS model using a Sidon-cleansed automatic speech recognition corpus improves the quality of synthetic speech in a zero-shot setting. Code and model are released to facilitate reproducible dataset cleansing for the research community.
中文摘要:Sidon是一款快速开源语音修复模型,能将嘈杂的真实环境语音转换为多语言的高质量录音,从而提升文本转语音系统的训练效果,其处理速度比实时快500倍。
English Summary: Sidon is a fast, open-source speech restoration model that cleans noisy speech into studio-quality audio across multiple languages, enabling improved text-to-speech training and running 500 times faster than real time.

Authors:Ruiyan Wang, Zhengxue Cheng, Zonghao Lin, Jun Ling, Yuzhou Liu, Yanru An, Rong Xie, Li Song
Title: SemanticGarment: Semantic-Controlled Generation and Editing of 3D Gaussian Garments
Abstract:
3D digital garment generation and editing play a pivotal role in fashion design, virtual try-on, and gaming. Traditional methods struggle to meet the growing demand due to technical complexity and high resource costs. Learning-based approaches offer faster, more diverse garment synthesis based on specific requirements and reduce human efforts and time costs. However, they still face challenges such as inconsistent multi-view geometry or textures and heavy reliance on detailed garment topology and manual rigging. We propose SemanticGarment, a 3D Gaussian-based method that realizes high-fidelity 3D garment generation from text or image prompts and supports semantic-based interactive editing for flexible user customization. To ensure multi-view consistency and garment fitting, we propose to leverage structural human priors for the generative model by introducing a 3D semantic clothing model, which initializes the geometry structure and lays the groundwork for view-consistent garment generation and editing. Without the need to regenerate or rely on existing mesh templates, our approach allows for rapid and diverse modifications to existing Gaussians, either globally or within a local region. To address the artifacts caused by self-occlusion for garment reconstruction based on single image, we develop a self-occlusion optimization strategy to mitigate holes and artifacts that arise when directly animating self-occluded garments. Extensive experiments are conducted to demonstrate our superior performance in 3D garment generation and editing.
中文: 提出的SemanticGarment方法采用3D高斯技术和人体结构先验,实现了从文本或图像生成高保真3D服装并支持基于语义的编辑,有效解决了多视角不一致和自遮挡伪影等挑战。
English: The proposed SemanticGarment method utilizes 3D Gaussian techniques and structural human priors to enable high-fidelity 3D garment generation from text or image inputs while supporting semantic-based editing, effectively addressing challenges like multi-view inconsistency and self-occlusion artifacts.

Authors:Mohammad Beigi, Ying Shen, Parshin Shojaee, Qifan Wang, Zichao Wang, Chandan Reddy, Ming Jin, Lifu Huang
Title: Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
Abstract:
Despite the remarkable capabilities of large language models, current training paradigms inadvertently foster \textit{sycophancy}, i.e., the tendency of a model to agree with or reinforce user-provided information even when it's factually incorrect. To address this challenge, we introduce \textbf{SMART} (Sycophancy Mitigation through Adaptive Reasoning Trajectories), which reframes sycophancy as a \textit{reasoning optimization problem} rather than an output alignment issue. SMART is a two-stage framework comprising: (1) Uncertainty-Aware Adaptive Monte Carlo Tree Search (UA-MCTS), which dynamically adjusts model exploration based on state-level uncertainty to collect high-quality, diverse reasoning trajectories alongside both stepwise progress and final outcome rewards; and (2) progress-based reinforcement learning, which fine-tunes the model using the collected trajectories and reward signals to reinforce effective reasoning patterns. Through extensive experiments, we show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs and maintaining general capabilities. These results underscore the importance of optimizing internal reasoning mechanisms to build more truthful and aligned AI assistants.
中文摘要:SMART框架通过将模型奉承问题重构为推理优化挑战,采用不确定性感知探索和基于进度的强化学习双阶段方法,显著减少了语言模型的盲从行为,同时保持了其泛化能力和核心性能。
English Summary: The SMART framework effectively mitigates sycophantic behavior in large language models by treating it as a reasoning optimization problem, combining uncertainty-aware exploration and progress-based reinforcement learning to enhance truthfulness without compromising overall performance.

Authors:Zhenlan Ji, Daoyuan Wu, Wenxuan Wang, Pingchuan Ma, Shuai Wang, Lei Ma
Title: Digging Into the Internal: Causality-Based Analysis of LLM Function Calling
Abstract:
Function calling (FC) has emerged as a powerful technique for facilitating large language models (LLMs) to interact with external systems and perform structured tasks. However, the mechanisms through which it influences model behavior remain largely under-explored. Besides, we discover that in addition to the regular usage of FC, this technique can substantially enhance the compliance of LLMs with user instructions. These observations motivate us to leverage causality, a canonical analysis method, to investigate how FC works within LLMs. In particular, we conduct layer-level and token-level causal interventions to dissect FC's impact on the model's internal computational logic when responding to user queries. Our analysis confirms the substantial influence of FC and reveals several in-depth insights into its mechanisms. To further validate our findings, we conduct extensive experiments comparing the effectiveness of FC-based instructions against conventional prompting methods. We focus on enhancing LLM safety robustness, a critical LLM application scenario, and evaluate four mainstream LLMs across two benchmark datasets. The results are striking: FC shows an average performance improvement of around 135% over conventional prompting methods in detecting malicious inputs, demonstrating its promising potential to enhance LLM reliability and capability in practical applications.
中文摘要:函数调用技术大幅提升了大语言模型对用户指令的遵从性和安全鲁棒性,在恶意输入检测任务中相比传统提示方法实现了约135%的性能提升。
English Summary: Function calling significantly enhances large language models' compliance with user instructions and safety robustness, achieving a 135% performance improvement over conventional methods in detecting malicious inputs.

Authors:Renjie Pi, Kehao Miao, Li Peihang, Runtao Liu, Jiahui Gao, Jipeng Zhang, Xiaofang Zhou
Title: Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models
Abstract:
Multimodal large language models (MLLMs) have demonstrated extraordinary capabilities in conducting conversations based on image inputs. However, we observe that MLLMs exhibit a pronounced form of visual sycophantic behavior. While similar behavior has also been noted in text-based large language models (LLMs), it becomes significantly more prominent when MLLMs process image inputs. We refer to this phenomenon as the "sycophantic modality gap." To better understand this issue, we further analyze the factors that contribute to the exacerbation of this gap. To mitigate the visual sycophantic behavior, we first experiment with naive supervised fine-tuning to help the MLLM resist misleading instructions from the user. However, we find that this approach also makes the MLLM overly resistant to corrective instructions (i.e., stubborn even if it is wrong). To alleviate this trade-off, we propose Sycophantic Reflective Tuning (SRT), which enables the MLLM to engage in reflective reasoning, allowing it to determine whether a user's instruction is misleading or corrective before drawing a conclusion. After applying SRT, we observe a significant reduction in sycophantic behavior toward misleading instructions, without resulting in excessive stubbornness when receiving corrective instructions.
中文: 多模态大语言模型表现出明显的视觉盲从行为,而提出的反思性调优方法有效减少了其对误导性指令的盲从,同时避免了对纠正性指令的过度固执。
English: Multimodal large language models exhibit a pronounced visual sycophantic behavior, which is mitigated by the proposed Sycophantic Reflective Tuning method that reduces compliance with misleading instructions without causing excessive stubbornness to corrective ones.

Authors:Sikai Bai, Haoxi Li, Jie Zhang, Zicong Hong, Song Guo
Title: DiEP: Adaptive Mixture-of-Experts Compression through Differentiable Expert Pruning
Abstract:
Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed \textbf{Di}fferentiable \textbf{E}xpert \textbf{P}runing (\textbf{DiEP}), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, \textbf{DiEP} retains around 92\% of original performance on Mixtral 8$\times$7B with only half the experts, outperforming other pruning methods by up to 7.1\% on the challenging MMLU dataset.
中文: 提出的DiEP方法通过非均匀分层剪枝策略解决MoE模型效率问题,在显著减少参数的同时保持约92%的原始性能,在MMLU数据集上比其他剪枝方法性能提升达7.1%。
English: The proposed DiEP method addresses MoE model inefficiency by implementing non-uniform, layer-adaptive pruning that preserves performance while significantly reducing parameters, outperforming existing approaches by up to 7.1% on benchmarks.

Authors:Chang Yu, Siyu Ma, Wenxin Du, Zeshun Zong, Han Xue, Wendi Chen, Cewu Lu, Yin Yang, Xuchen Han, Joseph Masterjohn, Alejandro Castro, Chenfanfu Jiang
Title: Right-Side-Out: Learning Zero-Shot Sim-to-Real Garment Reversal
Abstract:
Turning garments right-side out is a challenging manipulation task: it is highly dynamic, entails rapid contact changes, and is subject to severe visual occlusion. We introduce Right-Side-Out, a zero-shot sim-to-real framework that effectively solves this challenge by exploiting task structures. We decompose the task into Drag/Fling to create and stabilize an access opening, followed by Insert&Pull to invert the garment. Each step uses a depth-inferred, keypoint-parameterized bimanual primitive that sharply reduces the action space while preserving robustness. Efficient data generation is enabled by our custom-built, high-fidelity, GPU-parallel Material Point Method (MPM) simulator that models thin-shell deformation and provides robust and efficient contact handling for batched rollouts. Built on the simulator, our fully automated pipeline scales data generation by randomizing garment geometry, material parameters, and viewpoints, producing depth, masks, and per-primitive keypoint labels without any human annotations. With a single depth camera, policies trained entirely in simulation deploy zero-shot on real hardware, achieving up to 81.3% success rate. By employing task decomposition and high fidelity simulation, our framework enables tackling highly dynamic, severely occluded tasks without laborious human demonstrations.
中文摘要:Right-Side-Out框架通过任务分解和高精度模拟,解决了衣物翻面的动态挑战,无需人工演示即可实现81.3%的真实世界成功率。
English Summary: The Right-Side-Out framework solves the challenging task of turning garments right-side out through task decomposition and high-fidelity simulation, achieving 81.3% real-world success without human demonstrations.

Authors:Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang
Title: SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models
Abstract:
World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.
中文: SAMPO是一种混合世界模型,通过结合自回归与因果建模、高效解码和运动提示来提升视觉预测能力,在视频预测和控制任务中表现出色且推理速度更快。
English: SAMPO is a hybrid world model that enhances visual prediction by combining autoregressive and causal modeling with efficient decoding and motion prompts, achieving superior performance in video prediction and control tasks with faster inference.

Authors:Tianyang Duan, Zongyuan Zhang, Songxiao Guo, Yuanye Zhao, Zheng Lin, Zihan Fang, Yi Liu, Dianxin Luan, Dong Huang, Heming Cui, Yong Cui
Title: Sample Efficient Experience Replay in Non-stationary Environments
Abstract:
Reinforcement learning (RL) in non-stationary environments is challenging, as changing dynamics and rewards quickly make past experiences outdated. Traditional experience replay (ER) methods, especially those using TD-error prioritization, struggle to distinguish between changes caused by the agent's policy and those from the environment, resulting in inefficient learning under dynamic conditions. To address this challenge, we propose the Discrepancy of Environment Dynamics (DoE), a metric that isolates the effects of environment shifts on value functions. Building on this, we introduce Discrepancy of Environment Prioritized Experience Replay (DEER), an adaptive ER framework that prioritizes transitions based on both policy updates and environmental changes. DEER uses a binary classifier to detect environment changes and applies distinct prioritization strategies before and after each shift, enabling more sample-efficient learning. Experiments on four non-stationary benchmarks demonstrate that DEER further improves the performance of off-policy algorithms by 11.54 percent compared to the best-performing state-of-the-art ER methods.
Chinese: 本文提出了环境动态差异(DoE)指标和DEER框架,通过二元分类器根据策略更新和环境变化对转移进行优先级排序,在非平稳环境中将离线策略算法的性能提升了11.54%。
English: This paper introduces the Discrepancy of Environment Dynamics (DoE) metric and the DEER framework, which uses a binary classifier to prioritize transitions based on policy updates and environmental changes, improving off-policy algorithm performance by 11.54% in non-stationary environments.

Authors:Tianyang Duan, Zongyuan Zhang, Songxiao Guo, Dong Huang, Yuanye Zhao, Zheng Lin, Zihan Fang, Dianxin Luan, Heming Cui, Yong Cui
Title: LEED: A Highly Efficient and Scalable LLM-Empowered Expert Demonstrations Framework for Multi-Agent Reinforcement Learning
Abstract:
Multi-agent reinforcement learning (MARL) holds substantial promise for intelligent decision-making in complex environments. However, it suffers from a coordination and scalability bottleneck as the number of agents increases. To address these issues, we propose the LLM-empowered expert demonstrations framework for multi-agent reinforcement learning (LEED). LEED consists of two components: a demonstration generation (DG) module and a policy optimization (PO) module. Specifically, the DG module leverages large language models to generate instructions for interacting with the environment, thereby producing high-quality demonstrations. The PO module adopts a decentralized training paradigm, where each agent utilizes the generated demonstrations to construct an expert policy loss, which is then integrated with its own policy loss. This enables each agent to effectively personalize and optimize its local policy based on both expert knowledge and individual experience. Experimental results show that LEED achieves superior sample efficiency, time efficiency, and robust scalability compared to state-of-the-art baselines.
中文:提出的LEED框架通过利用大语言模型生成专家示范,使智能体能够结合自身经验优化策略,从而在多智能体强化学习中实现了更高的效率和可扩展性。
English: The proposed LEED framework enhances multi-agent reinforcement learning by using large language models to generate expert demonstrations, which agents then integrate with their own experiences to optimize policies, achieving superior efficiency and scalability in experiments.

Authors:Ming Li, Nan Zhang, Chenrui Fan, Hong Jiao, Yanbin Fu, Sydney Peters, Qingshu Xu, Robert Lissitz, Tianyi Zhou
Title: Understanding the Thinking Process of Reasoning Models: A Perspective from Schoenfeld's Episode Theory
Abstract:
While Large Reasoning Models (LRMs) generate extensive chain-of-thought reasoning, we lack a principled framework for understanding how these thoughts are structured. In this paper, we introduce a novel approach by applying Schoenfeld's Episode Theory, a classic cognitive framework for human mathematical problem-solving, to analyze the reasoning traces of LRMs. We annotated thousands of sentences and paragraphs from model-generated solutions to math problems using seven cognitive labels (e.g., Plan, Implement, Verify). The result is the first publicly available benchmark for the fine-grained analysis of machine reasoning, including a large annotated corpus and detailed annotation guidebooks. Our preliminary analysis reveals distinct patterns in LRM reasoning, such as the transition dynamics between cognitive states. This framework provides a theoretically grounded methodology for interpreting LRM cognition and enables future work on more controllable and transparent reasoning systems.
中文: 本文创新性地应用舍恩菲尔德的问题解决理论分析大型推理模型的思维结构,建立了首个细粒度机器推理分析基准,揭示了独特的认知模式,为构建更可控透明的推理系统提供了理论基础。
English: This paper introduces a novel application of Schoenfeld's Episode Theory to analyze the reasoning structures of Large Reasoning Models, creating the first publicly available benchmark for fine-grained analysis that reveals distinct cognitive patterns and enables more interpretable AI systems.

Authors:Kairong Ma, Yao Sun, Shuheng Hua, Muhammad Ali Imran, Walid Saad
Title: A Unified Learning-based Optimization Framework for 0-1 Mixed Problems in Wireless Networks
Abstract:
Several wireless networking problems are often posed as 0-1 mixed optimization problems, which involve binary variables (e.g., selection of access points, channels, and tasks) and continuous variables (e.g., allocation of bandwidth, power, and computing resources). Traditional optimization methods as well as reinforcement learning (RL) algorithms have been widely exploited to solve these problems under different network scenarios. However, solving such problems becomes more challenging when dealing with a large network scale, multi-dimensional radio resources, and diversified service requirements. To this end, in this paper, a unified framework that combines RL and optimization theory is proposed to solve 0-1 mixed optimization problems in wireless networks. First, RL is used to capture the process of solving binary variables as a sequential decision-making task. During the decision-making steps, the binary (0-1) variables are relaxed and, then, a relaxed problem is solved to obtain a relaxed solution, which serves as prior information to guide RL searching policy. Then, at the end of decision-making process, the search policy is updated via suboptimal objective value based on decisions made. The performance bound and convergence guarantees of the proposed framework are then proven theoretically. An extension of this approach is provided to solve problems with a non-convex objective function and/or non-convex constraints. Numerical results show that the proposed approach reduces the convergence time by about 30% over B&B in small-scale problems with slightly higher objective values. In large-scale scenarios, it can improve the normalized objective values by 20% over RL with a shorter convergence time.
中文: 本文提出了一种结合强化学习和优化理论的统一框架,用于高效解决无线网络中的0-1混合优化问题,在小型和大型场景中均实现了更快的收敛速度和更好的性能表现。
English: This paper introduces a unified framework combining reinforcement learning and optimization theory to efficiently solve 0-1 mixed optimization problems in wireless networks, achieving faster convergence and improved performance in both small-scale and large-scale scenarios.

Authors:Kairong Ma, Yao Sun, Shuheng Hua, Muhammad Ali Imran, Walid Saad
Title: A Unified Learning-based Optimization Framework for 0-1 Mixed Problems in Wireless Networks
Abstract:
Several wireless networking problems are often posed as 0-1 mixed optimization problems, which involve binary variables (e.g., selection of access points, channels, and tasks) and continuous variables (e.g., allocation of bandwidth, power, and computing resources). Traditional optimization methods as well as reinforcement learning (RL) algorithms have been widely exploited to solve these problems under different network scenarios. However, solving such problems becomes more challenging when dealing with a large network scale, multi-dimensional radio resources, and diversified service requirements. To this end, in this paper, a unified framework that combines RL and optimization theory is proposed to solve 0-1 mixed optimization problems in wireless networks. First, RL is used to capture the process of solving binary variables as a sequential decision-making task. During the decision-making steps, the binary (0-1) variables are relaxed and, then, a relaxed problem is solved to obtain a relaxed solution, which serves as prior information to guide RL searching policy. Then, at the end of decision-making process, the search policy is updated via suboptimal objective value based on decisions made. The performance bound and convergence guarantees of the proposed framework are then proven theoretically. An extension of this approach is provided to solve problems with a non-convex objective function and/or non-convex constraints. Numerical results show that the proposed approach reduces the convergence time by about 30% over B&B in small-scale problems with slightly higher objective values. In large-scale scenarios, it can improve the normalized objective values by 20% over RL with a shorter convergence time.
中文: 本文提出了一种结合强化学习和优化理论的统一框架,用于高效解决无线网络中的0-1混合优化问题,在小型和大型场景中均实现了更快的收敛速度和更好的性能表现。
English: This paper introduces a unified framework combining reinforcement learning and optimization theory to efficiently solve 0-1 mixed optimization problems in wireless networks, achieving faster convergence and improved performance in both small-scale and large-scale scenarios.

Authors:Ruizhong Qiu, Ting-Wei Li, Gaotang Li, Hanghang Tong
Title: Graph Homophily Booster: Rethinking the Role of Discrete Features on Heterophilic Graphs
Abstract:
Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph-structured data. However, existing GNNs often struggle with heterophilic graphs, where connected nodes tend to have dissimilar features or labels. While numerous methods have been proposed to address this challenge, they primarily focus on architectural designs without directly targeting the root cause of the heterophily problem. These approaches still perform even worse than the simplest MLPs on challenging heterophilic datasets. For instance, our experiments show that 21 latest GNNs still fall behind the MLP on the Actor dataset. This critical challenge calls for an innovative approach to addressing graph heterophily beyond architectural designs. To bridge this gap, we propose and study a new and unexplored paradigm: directly increasing the graph homophily via a carefully designed graph transformation. In this work, we present a simple yet effective framework called GRAPHITE to address graph heterophily. To the best of our knowledge, this work is the first method that explicitly transforms the graph to directly improve the graph homophily. Stemmed from the exact definition of homophily, our proposed GRAPHITE creates feature nodes to facilitate homophilic message passing between nodes that share similar features. Furthermore, we both theoretically and empirically show that our proposed GRAPHITE significantly increases the homophily of originally heterophilic graphs, with only a slight increase in the graph size. Extensive experiments on challenging datasets demonstrate that our proposed GRAPHITE significantly outperforms state-of-the-art methods on heterophilic graphs while achieving comparable accuracy with state-of-the-art methods on homophilic graphs.
中文: 图神经网络在异配图数据上表现不佳,而提出的GRAPHITE框架通过创建特征节点来增强图同配性,在保持图规模小幅增长的同时,显著提升了异配图上的性能表现。
English: Graph neural networks often underperform on heterophilic graphs, but the proposed GRAPHITE framework innovatively addresses this by transforming graphs to increase homophily through feature nodes, achieving superior results on challenging datasets.

Authors:Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
Title: Selective Risk Certification for LLM Outputs via Information-Lift Statistics: PAC-Bayes, Robustness, and Skeleton Design
Abstract:
Large language models frequently generate confident but incorrect outputs, requiring formal uncertainty quantification with abstention guarantees. We develop information-lift certificates that compare model probabilities to a skeleton baseline, accumulating evidence into sub-gamma PAC-Bayes bounds valid under heavy-tailed distributions. Across eight datasets, our method achieves 77.2\% coverage at 2\% risk, outperforming recent 2023-2024 baselines by 8.6-15.1 percentage points, while blocking 96\% of critical errors in high-stakes scenarios vs 18-31\% for entropy methods. Limitations include skeleton dependence and frequency-only (not severity-aware) risk control, though performance degrades gracefully under corruption.
Chinese: 我们开发的信息提升证书通过PAC-Bayes边界量化大语言模型的不确定性,在八个数据集上以2%风险实现77.2%覆盖率,同时阻止96%的关键错误,性能较2023-2024年基线方法提升8.6-15.1个百分点。
English: Our method develops information-lift certificates using PAC-Bayes bounds to quantify uncertainty in large language models, achieving 77.2% coverage at 2% risk across eight datasets while blocking 96% of critical errors and outperforming recent baselines by 8.6-15.1 percentage points.

Authors:Jinghua Zhao, Hang Su, Lichun Fan, Zhenbo Luo, Hui Wang, Haoqin Sun, Yong Qin
Title: Omni-CLST: Error-aware Curriculum Learning with guided Selective chain-of-Thought for audio question answering
Abstract:
With the rapid progress of large audio-language models (LALMs), audio question answering (AQA) has emerged as a challenging task requiring both fine-grained audio understanding and complex reasoning. While current methods mainly rely on constructing new datasets via captioning or reasoning traces, existing high-quality AQA data remains underutilized. To address this, we propose Omni-CLST, an error-aware Curriculum Learning framework with guided Selective Chain-of-Thought. The framework efficiently leverages existing high-quality dataset through two key strategies: an error-aware curriculum that organizes samples by difficulty, and a guided thought dropout mechanism that focuses reasoning on challenging cases. Experiments show that Omni-CLST achieves 73.80% on MMAU-mini and a new state of the art of 64.30% on MMAR, demonstrating robust generalization in multimodal audio-language understanding.
Chinese: 提出的Omni-CLST框架通过误差感知课程学习和引导式选择性思维链,有效利用现有高质量数据,在基准测试中实现了最优性能,显著提升了音频问答任务的表现。
English: The proposed Omni-CLST framework enhances audio question answering by implementing error-aware curriculum learning and guided selective chain-of-thought, achieving state-of-the-art performance on benchmark datasets through optimized utilization of existing high-quality data.

Authors:Xinyu He, Chenhan Xiao, Haoran Li, Ruizhong Qiu, Zhe Xu, Yang Weng, Jingrui He, Hanghang Tong
Title: PowerGrow: Feasible Co-Growth of Structures and Dynamics for Power Grid Synthesis
Abstract:
Modern power systems are becoming increasingly dynamic, with changing topologies and time-varying loads driven by renewable energy variability, electric vehicle adoption, and active grid reconfiguration. Despite these changes, publicly available test cases remain scarce, due to security concerns and the significant effort required to anonymize real systems. Such limitations call for generative tools that can jointly synthesize grid structure and nodal dynamics. However, modeling the joint distribution of network topology, branch attributes, bus properties, and dynamic load profiles remains a major challenge, while preserving physical feasibility and avoiding prohibitive computational costs. We present PowerGrow, a co-generative framework that significantly reduces computational overhead while maintaining operational validity. The core idea is dependence decomposition: the complex joint distribution is factorized into a chain of conditional distributions over feasible grid topologies, time-series bus loads, and other system attributes, leveraging their mutual dependencies. By constraining the generation process at each stage, we implement a hierarchical graph beta-diffusion process for structural synthesis, paired with a temporal autoencoder that embeds time-series data into a compact latent space, improving both training stability and sample fidelity. Experiments across benchmark settings show that PowerGrow not only outperforms prior diffusion models in fidelity and diversity but also achieves a 98.9\% power flow convergence rate and improved N-1 contingency resilience. This demonstrates its ability to generate operationally valid and realistic power grid scenarios.
中文摘要:PowerGrow通过依赖分解和分层扩散过程,高效协同生成电网拓扑与动态负荷,实现了98.9%的潮流收敛率,能产生具备高运行有效性的逼真电网场景。
English Summary: PowerGrow is a co-generative framework that efficiently synthesizes realistic power grid structures and dynamic load profiles through dependence decomposition and hierarchical diffusion processes, achieving high operational validity with 98.9% power flow convergence.

Authors:Wei Cai, Shujuan Liu, Jian Zhao, Ziyan Shi, Yusheng Zhao, Yuchen Yuan, Tianle Zhang, Chi Zhang, Xuelong Li
Title: When Safe Unimodal Inputs Collide: Optimizing Reasoning Chains for Cross-Modal Safety in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) are susceptible to the implicit reasoning risk, wherein innocuous unimodal inputs synergistically assemble into risky multimodal data that produce harmful outputs. We attribute this vulnerability to the difficulty of MLLMs maintaining safety alignment through long-chain reasoning. To address this issue, we introduce Safe-Semantics-but-Unsafe-Interpretation (SSUI), the first dataset featuring interpretable reasoning paths tailored for such a cross-modal challenge. A novel training framework, Safety-aware Reasoning Path Optimization (SRPO), is also designed based on the SSUI dataset to align the MLLM's internal reasoning process with human safety values. Experimental results show that our SRPO-trained models achieve state-of-the-art results on key safety benchmarks, including the proposed Reasoning Path Benchmark (RSBench), significantly outperforming both open-source and top-tier commercial MLLMs.
中文: 多模态大语言模型存在隐含推理风险,即无害的单模态输入会协同形成危险的多模态数据并产生有害输出,而新提出的安全感知推理路径优化框架通过专门数据集使模型的内部推理过程与人类安全价值观对齐来解决这一问题。
English: Multimodal Large Language Models face implicit reasoning risks where safe individual inputs combine to create harmful outputs, which the proposed Safety-aware Reasoning Path Optimization framework addresses by aligning reasoning processes with human safety values using a specialized dataset.

Authors:Weishu Chen, Jinyi Tang, Zhouhui Hou, Shihao Han, Mingjie Zhan, Zhiyuan Huang, Delong Liu, Jiawei Guo, Zhicheng Zhao, Fei Su
Title: MOOM: Maintenance, Organization and Optimization of Memory in Ultra-Long Role-Playing Dialogues
Abstract:
Memory extraction is crucial for maintaining coherent ultra-long dialogues in human-robot role-playing scenarios. However, existing methods often exhibit uncontrolled memory growth. To address this, we propose MOOM, the first dual-branch memory plugin that leverages literary theory by modeling plot development and character portrayal as core storytelling elements. Specifically, one branch summarizes plot conflicts across multiple time scales, while the other extracts the user's character profile. MOOM further integrates a forgetting mechanism, inspired by the ``competition-inhibition'' memory theory, to constrain memory capacity and mitigate uncontrolled growth. Furthermore, we present ZH-4O, a Chinese ultra-long dialogue dataset specifically designed for role-playing, featuring dialogues that average 600 turns and include manually annotated memory information. Experimental results demonstrate that MOOM outperforms all state-of-the-art memory extraction methods, requiring fewer large language model invocations while maintaining a controllable memory capacity.
Chinese: 我们提出MOOM双分支记忆插件,通过建模情节发展与角色塑造并结合遗忘机制控制记忆增长,在专为超长对话设计的ZH-4O数据集上验证了其优越性能。
English: We introduce MOOM, a dual-branch memory plugin that models plot and character development with a forgetting mechanism to control memory growth, and validate its superior performance on the new ZH-4O dataset for ultra-long dialogues.

Authors:Zhantong Xue, Pingchuan Ma, Zhaoyu Wang, Shuai Wang
Title: From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation
Abstract:
Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, blockchain scalability, and secure finance. However, authoring ZK programs remains challenging: unlike mainstream programming, ZK development requires reasoning about finite field arithmetic, constraint systems, and gadgets, making it knowledge-intensive and error-prone. While large language models (LLMs) have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and gadget-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities at three levels: language knowledge, gadget competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that models excel at surface-level syntax but struggle with gadget usage and semantic correctness, often yielding incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments on Circom and Noir show substantial gains, with success rates improving from 17.35\% to 83.38\% and from 32.21\% to 90.05\%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a foundation for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance trustworthy computation.
中文: 本研究提出ZK-Eval评估大语言模型在零知识编程中的不足,并开发ZK-Coder代理框架,显著提升了Circom和Noir语言的代码生成成功率。
English: The study introduces ZK-Eval to assess LLMs' limitations in zero-knowledge programming and proposes ZK-Coder, an agentic framework that significantly enhances code generation success rates in Circom and Noir.

Authors:Canhui Tang, Sanping Zhou, Haoyue Shi, Le Wang
Title: Action Hints: Semantic Typicality and Context Uniqueness for Generalizable Skeleton-based Video Anomaly Detection
Abstract:
Zero-Shot Video Anomaly Detection (ZS-VAD) requires temporally localizing anomalies without target domain training data, which is a crucial task due to various practical concerns, e.g., data privacy or new surveillance deployments. Skeleton-based approach has inherent generalizable advantages in achieving ZS-VAD as it eliminates domain disparities both in background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-limited normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel zero-shot video anomaly detection framework, unlocking the potential of skeleton data via action typicality and uniqueness learning. Firstly, we introduce a language-guided semantic typicality modeling module that projects skeleton snippets into action semantic space and distills LLM's knowledge of typical normal and abnormal behaviors during training. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results against skeleton-based methods on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime, featuring over 100 unseen surveillance scenes.
中文摘要:本文提出了一种新颖的零样本视频异常检测框架,通过动作典型性和独特性学习挖掘骨骼数据潜力,在无需目标领域训练样本的情况下,于多个数据集上实现了最先进的检测性能。
English Summary: This paper introduces a novel zero-shot video anomaly detection framework that leverages action typicality and uniqueness learning from skeleton data, achieving state-of-the-art performance across multiple datasets without requiring target domain training.

Authors:Haozhen Yan, Yan Hong, Suning Lang, Jiahui Zhan, Yikun Ji, Yujie Gao, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang
Title: GAMMA: Generalizable Alignment via Multi-task and Manipulation-Augmented Training for AI-Generated Image Detection
Abstract:
With generative models becoming increasingly sophisticated and diverse, detecting AI-generated images has become increasingly challenging. While existing AI-genereted Image detectors achieve promising performance on in-distribution generated images, their generalization to unseen generative models remains limited. This limitation is largely attributed to their reliance on generation-specific artifacts, such as stylistic priors and compression patterns. To address these limitations, we propose GAMMA, a novel training framework designed to reduce domain bias and enhance semantic alignment. GAMMA introduces diverse manipulation strategies, such as inpainting-based manipulation and semantics-preserving perturbations, to ensure consistency between manipulated and authentic content. We employ multi-task supervision with dual segmentation heads and a classification head, enabling pixel-level source attribution across diverse generative domains. In addition, a reverse cross-attention mechanism is introduced to allow the segmentation heads to guide and correct biased representations in the classification branch. Our method achieves state-of-the-art generalization performance on the GenImage benchmark, imporving accuracy by 5.8%, but also maintains strong robustness on newly released generative model such as GPT-4o.
中文摘要:提出的GAMMA框架通过多样化处理策略和多任务监督来减少领域偏差,在基准测试中实现了最优泛化性能,并对GPT-4o等新型生成模型保持强鲁棒性。
English Summary: The proposed GAMMA framework enhances AI-generated image detection by reducing domain bias through diverse manipulation strategies and multi-task supervision, achieving state-of-the-art generalization on benchmarks and robustness against new models like GPT-4o.

Authors:Zhengyu Hu, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Jianxun Lian, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
Title: Population-Aligned Persona Generation for LLM-based Social Simulation
Abstract:
Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.
中文: 本文提出了一种系统框架,利用大语言模型从社交媒体数据生成高质量、与人口分布对齐的角色集,通过重要性采样和心理测量校准显著减少偏差,提升社会模拟的准确性和灵活性。
English: This paper introduces a systematic framework for creating high-quality, population-aligned persona sets from social media data using LLMs, which reduces bias and enhances the accuracy of social simulations across diverse applications.

Authors:Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
Title: Population-Aligned Persona Generation for LLM-based Social Simulation
Abstract:
Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.
中文: 本文提出了一种系统框架,利用大语言模型从社交媒体数据生成高质量、与人口分布对齐的角色集,通过重要性采样和心理测量校准显著减少偏差,提升社会模拟的准确性和灵活性。
English: This paper introduces a systematic framework for creating high-quality, population-aligned persona sets from social media data using LLMs, which reduces bias and enhances the accuracy of social simulations across diverse applications.

Authors:Anqi Chen, Riccardo Preatoni, Alessandro Brighente, Mauro Conti, Cristina Nita-Rotaru
Title: Cross-Service Token: Finding Attacks in 5G Core Networks
Abstract:
5G marks a major departure from previous cellular architectures, by transitioning from a monolithic design of the core network to a Service-Based Architecture (SBA) where services are modularized as Network Functions (NFs) which communicate with each other via standard-defined HTTP-based APIs called Service-Based Interfaces (SBIs). These NFs are deployed in private and public cloud infrastructure, and an access control framework based on OAuth restricts how they communicate with each other and obtain access to resources. Given the increased vulnerabilities of clouds to insiders, it is important to study the security of the 5G Core services for vulnerabilities that allow attackers to use compromised NFs to obtain unauthorized access to resources. We present FivGeeFuzz, a grammar-based fuzzing framework designed to uncover security flaws in 5G core SBIs. FivGeeFuzz automatically derives grammars from 3GPP API specifications to generate malformed, unexpected, or semantically inconsistent inputs, and it integrates automated bug detection with manual validation and root-cause analysis. We evaluate our approach on free5GC, the only open-source 5G core implementing Release 17-compliant SBIs with an access control mechanism. Using FivGeeFuzz, we discovered 8 previously unknown vulnerabilities in free5GC, leading to runtime crashes, improper error handling, and unauthorized access to resources, including a very severe attack we call Cross-Service Token Attack. All bugs were confirmed by the free5GC team, 7 have already been patched, and the remaining one has a patch under development.
中文: 5G采用基于服务的架构,通过模块化网络功能进行通信,而FivGeeFuzz模糊测试框架在free5GC中发现了8个漏洞,包括未授权访问等安全风险。
English: 5G introduces a Service-Based Architecture with modular Network Functions communicating via HTTP-based APIs, and FivGeeFuzz is a fuzzing framework that discovered 8 vulnerabilities in free5GC, including unauthorized access risks.

Authors:Rochana Prih Hastuti, Rian Adam Rajagede, Mansour Al Ghanim, Mengxin Zheng, Qian Lou
Title: Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts
Abstract:
As large language models (LLMs) are adapted to sensitive domains such as medicine, their fluency raises safety risks, particularly regarding provenance and accountability. Watermarking embeds detectable patterns to mitigate these risks, yet its reliability in medical contexts remains untested. Existing benchmarks focus on detection-quality tradeoffs and overlook factual risks. In medical text, watermarking often reweights low-entropy tokens, which are highly predictable and often carry critical medical terminology. Shifting these tokens can cause inaccuracy and hallucinations, risks that prior general-domain benchmarks fail to capture. We propose a medical-focused evaluation workflow that jointly assesses factual accuracy and coherence. Using GPT-Judger and further human validation, we introduce the Factuality-Weighted Score (FWS), a composite metric prioritizing factual accuracy beyond coherence to guide watermarking deployment in medical domains. Our evaluation shows current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation. These findings underscore the need for domain-aware watermarking approaches that preserve the integrity of medical content.
中文摘要:当前大型语言模型的水印技术虽能降低风险,却因改变关键医学术语而严重损害事实准确性,需采用如事实性加权评分等专业评估方法,以保障医疗内容的精确性与完整性。
English Summary: Current watermarking techniques for large language models, while mitigating some risks, significantly compromise medical factuality by altering critical terminology, necessitating domain-specific evaluations like the proposed Factuality-Weighted Score to ensure accuracy and integrity in healthcare applications.

Authors:Haiqing Ren, Zhongkai Luo, Heng Fan, Xiaohui Yuan, Guanchen Wang, Libo Zhang
Title: G3CN: Gaussian Topology Refinement Gated Graph Convolutional Network for Skeleton-Based Action Recognition
Abstract:
Graph Convolutional Networks (GCNs) have proven to be highly effective for skeleton-based action recognition, primarily due to their ability to leverage graph topology for feature aggregation, a key factor in extracting meaningful representations. However, despite their success, GCNs often struggle to effectively distinguish between ambiguous actions, revealing limitations in the representation of learned topological and spatial features. To address this challenge, we propose a novel approach, Gaussian Topology Refinement Gated Graph Convolution (G$^{3}$CN), to address the challenge of distinguishing ambiguous actions in skeleton-based action recognition. G$^{3}$CN incorporates a Gaussian filter to refine the skeleton topology graph, improving the representation of ambiguous actions. Additionally, Gated Recurrent Units (GRUs) are integrated into the GCN framework to enhance information propagation between skeleton points. Our method shows strong generalization across various GCN backbones. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks demonstrate that G$^{3}$CN effectively improves action recognition, particularly for ambiguous samples.
中文: 提出的G³CN方法通过高斯滤波器优化图拓扑结构并集成门控循环单元,有效提升了基于骨架动作识别中模糊动作的区分能力,在多个基准测试中表现优异。
English: The proposed G³CN method enhances skeleton-based action recognition by refining graph topology with Gaussian filters and integrating GRUs to better distinguish ambiguous actions, demonstrating strong performance across multiple benchmarks.

Authors:Yue Gu, Zhihao Du, Ying Shi, Shiliang Zhang, Qian Chen, Jiqing Han
Title: Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling
Abstract:
Recently, cross-attention-based contextual automatic speech recognition (ASR) models have made notable advancements in recognizing personalized biasing phrases. However, the effectiveness of cross-attention is affected by variations in biasing information volume, especially when the length of the biasing list increases significantly. We find that, regardless of the length of the biasing list, only a limited amount of biasing information is most relevant to a specific ASR intermediate representation. Therefore, by identifying and integrating the most relevant biasing information rather than the entire biasing list, we can alleviate the effects of variations in biasing information volume for contextual ASR. To this end, we propose a purified semantic correlation joint modeling (PSC-Joint) approach. In PSC-Joint, we define and calculate three semantic correlations between the ASR intermediate representations and biasing information from coarse to fine: list-level, phrase-level, and token-level. Then, the three correlations are jointly modeled to produce their intersection, so that the most relevant biasing information across various granularities is highlighted and integrated for contextual recognition. In addition, to reduce the computational cost introduced by the joint modeling of three semantic correlations, we also propose a purification mechanism based on a grouped-and-competitive strategy to filter out irrelevant biasing phrases. Compared with baselines, our PSC-Joint approach achieves average relative F1 score improvements of up to 21.34% on AISHELL-1 and 28.46% on KeSpeech, across biasing lists of varying lengths.
中文:提出的PSC-Joint方法通过联合建模多层级语义关联,选择性整合最相关的偏置信息来增强上下文ASR性能,在不同长度的偏置列表上均实现了显著提升。
English: The proposed PSC-Joint approach enhances contextual ASR by jointly modeling multi-level semantic correlations to selectively integrate the most relevant biasing information, achieving significant performance improvements across varying biasing list lengths.

Authors:Juraj Vladika, Mahdi Dhaini, Florian Matthes
Title: Facts Fade Fast: Evaluating Memorization of Outdated Medical Knowledge in Large Language Models
Abstract:
The growing capabilities of Large Language Models (LLMs) show significant potential to enhance healthcare by assisting medical researchers and physicians. However, their reliance on static training data is a major risk when medical recommendations evolve with new research and developments. When LLMs memorize outdated medical knowledge, they can provide harmful advice or fail at clinical reasoning tasks. To investigate this problem, we introduce two novel question-answering (QA) datasets derived from systematic reviews: MedRevQA (16,501 QA pairs covering general biomedical knowledge) and MedChangeQA (a subset of 512 QA pairs where medical consensus has changed over time). Our evaluation of eight prominent LLMs on the datasets reveals consistent reliance on outdated knowledge across all models. We additionally analyze the influence of obsolete pre-training data and training strategies to explain this phenomenon and propose future directions for mitigation, laying the groundwork for developing more current and reliable medical AI systems.
中文: 本研究通过构建两个新颖数据集评估大语言模型对过时医学知识的依赖,发现八个主流模型均存在这一问题,并提出了改进方向以开发更可靠的医疗人工智能系统。
English: This study introduces two novel datasets to evaluate large language models' reliance on outdated medical knowledge, revealing consistent performance issues across eight prominent models and proposing mitigation strategies for developing more reliable medical AI systems.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: What Fundamental Structure in Reward Functions Enables Efficient Sparse-Reward Learning?
Abstract:
Sparse-reward reinforcement learning (RL) remains fundamentally hard: without structure, any agent needs $Ω(|\mathcal{S}||\mathcal{A}|/p)$ samples to recover rewards. We introduce Policy-Aware Matrix Completion (PAMC) as a first concrete step toward a structural reward learning framework. Our key idea is to exploit approximate low-rank + sparse structure in the reward matrix, under policy-biased (MNAR) sampling. We prove recovery guarantees with inverse-propensity weighting, and establish a visitation-weighted error-to-regret bound linking completion error to control performance. Importantly, when assumptions weaken, PAMC degrades gracefully: confidence intervals widen and the algorithm abstains, ensuring safe fallback to exploration. Empirically, PAMC improves sample efficiency across Atari-26 (10M steps), DM Control, MetaWorld MT50, D4RL offline RL, and preference-based RL benchmarks, outperforming DrQ-v2, DreamerV3, Agent57, T-REX/D-REX, and PrefPPO under compute-normalized comparisons. Our results highlight PAMC as a practical and principled tool when structural rewards exist, and as a concrete first instantiation of a broader structural reward learning perspective.
中文: 策略感知矩阵补全(PAMC)提出了一种结构奖励学习框架,利用策略偏置采样下的低秩稀疏奖励结构,在稀疏奖励强化学习中实现了更高的样本效率,并在多个基准测试中表现出优雅的性能退化特性。
English: Policy-Aware Matrix Completion (PAMC) introduces a structural reward learning framework that leverages low-rank and sparse reward structures under policy-biased sampling, achieving improved sample efficiency and graceful degradation in sparse-reward reinforcement learning across multiple benchmarks.

Authors:Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Title: Differentiable Entropy Regularization for Geometry and Neural Networks
Abstract:
We introduce a differentiable estimator of range-partition entropy, a recent concept from computational geometry that enables algorithms to adapt to the "sortedness" of their input. While range-partition entropy provides strong guarantees in algorithm design, it has not yet been made accessible to deep learning. In this work, we (i) propose the first differentiable approximation of range-partition entropy, enabling its use as a trainable loss or regularizer; (ii) design EntropyNet, a neural module that restructures data into low-entropy forms to accelerate downstream instance-optimal algorithms; and (iii) extend this principle beyond geometry by applying entropy regularization directly to Transformer attention. Across tasks, we demonstrate that differentiable entropy improves efficiency without degrading correctness: in geometry, our method achieves up to $4.1\times$ runtime speedups with negligible error ($<0.2%$); in deep learning, it induces structured attention patterns that yield 6% higher accuracy at 80% sparsity compared to L1 baselines. Our theoretical analysis provides approximation bounds for the estimator, and extensive ablations validate design choices. These results suggest that entropy-bounded computation is not only theoretically elegant but also a practical mechanism for adaptive learning, efficiency, and structured representation.
中文摘要:本研究提出了可微分范围划分熵估计器,将其作为可训练损失函数或正则化器融入深度学习,能在不影响准确性的前提下提升计算效率并增强结构化表征能力。
English Summary: This work introduces a differentiable estimator for range-partition entropy, enabling its integration into deep learning as a trainable loss or regularizer to improve computational efficiency and structured representations without compromising accuracy.

Authors:Sophia Bianchi Moyen, Rickmer Krohn, Sophie Lueth, Kay Pompetzki, Jan Peters, Vignesh Prasad, Georgia Chalvatzaki
Title: The Role of Embodiment in Intuitive Whole-Body Teleoperation for Mobile Manipulation
Abstract:
Intuitive Teleoperation interfaces are essential for mobile manipulation robots to ensure high quality data collection while reducing operator workload. A strong sense of embodiment combined with minimal physical and cognitive demands not only enhances the user experience during large-scale data collection, but also helps maintain data quality over extended periods. This becomes especially crucial for challenging long-horizon mobile manipulation tasks that require whole-body coordination. We compare two distinct robot control paradigms: a coupled embodiment integrating arm manipulation and base navigation functions, and a decoupled embodiment treating these systems as separate control entities. Additionally, we evaluate two visual feedback mechanisms: immersive virtual reality and conventional screen-based visualization of the robot's field of view. These configurations were systematically assessed across a complex, multi-stage task sequence requiring integrated planning and execution. Our results show that the use of VR as a feedback modality increases task completion time, cognitive workload, and perceived effort of the teleoperator. Coupling manipulation and navigation leads to a comparable workload on the user as decoupling the embodiments, while preliminary experiments suggest that data acquired by coupled teleoperation leads to better imitation learning performance. Our holistic view on intuitive teleoperation interfaces provides valuable insight into collecting high-quality, high-dimensional mobile manipulation data at scale with the human operator in mind. Project website:https://sophiamoyen.github.io/role-embodiment-wbc-moma-teleop/
中文: 直观的遥操作界面通过耦合机械臂操控与底盘导航功能可提升移动操作任务的数据质量,其中虚拟现实反馈会增加操作员负担,而耦合控制模式在模仿学习性能方面展现出优势。
English: Intuitive teleoperation interfaces that couple manipulation and navigation functions can enhance data quality for mobile manipulation tasks, with VR feedback increasing operator workload while coupled control shows promise for improving imitation learning performance.

Authors:Chao Fan, Xibin Jia, Anqi Xiao, Hongyuan Yu, Zhenghan Yang, Dawei Yang, Hui Xu, Yan Huang, Liang Wang
Title: SPENet: Self-guided Prototype Enhancement Network for Few-shot Medical Image Segmentation
Abstract:
Few-Shot Medical Image Segmentation (FSMIS) aims to segment novel classes of medical objects using only a few labeled images. Prototype-based methods have made significant progress in addressing FSMIS. However, they typically generate a single global prototype for the support image to match with the query image, overlooking intra-class variations. To address this issue, we propose a Self-guided Prototype Enhancement Network (SPENet). Specifically, we introduce a Multi-level Prototype Generation (MPG) module, which enables multi-granularity measurement between the support and query images by simultaneously generating a global prototype and an adaptive number of local prototypes. Additionally, we observe that not all local prototypes in the support image are beneficial for matching, especially when there are substantial discrepancies between the support and query images. To alleviate this issue, we propose a Query-guided Local Prototype Enhancement (QLPE) module, which adaptively refines support prototypes by incorporating guidance from the query image, thus mitigating the negative effects of such discrepancies. Extensive experiments on three public medical datasets demonstrate that SPENet outperforms existing state-of-the-art methods, achieving superior performance.
中文摘要:自引导原型增强网络(SPENet)通过生成多级原型并结合查询图像进行局部优化,有效解决了少样本医学图像分割中的类内差异问题,在三个公共医学数据集上实现了最优性能。
English Summary: The Self-guided Prototype Enhancement Network (SPENet) addresses limitations in few-shot medical image segmentation by generating multi-level prototypes and refining them with query guidance, achieving state-of-the-art results across three medical datasets.

Authors:Anum Afzal, Juraj Vladika, Florian Matthes
Title: FActBench: A Benchmark for Fine-grained Automatic Evaluation of LLM-Generated Text in the Medical Domain
Abstract:
Large Language Models tend to struggle when dealing with specialized domains. While all aspects of evaluation hold importance, factuality is the most critical one. Similarly, reliable fact-checking tools and data sources are essential for hallucination mitigation. We address these issues by providing a comprehensive Fact-checking Benchmark FActBench covering four generation tasks and six state-of-the-art Large Language Models (LLMs) for the Medical domain. We use two state-of-the-art Fact-checking techniques: Chain-of-Thought (CoT) Prompting and Natural Language Inference (NLI). Our experiments show that the fact-checking scores acquired through the Unanimous Voting of both techniques correlate best with Domain Expert Evaluation.
中文摘要:大语言模型在专业领域面临挑战,尤其在事实准确性方面,FActBench通过评估六个模型的四项生成任务,采用思维链和自然语言推理技术,表明两种方法的一致投票结果与专家评估最为吻合。
English Summary: Large Language Models face challenges in specialized domains, particularly with factuality, which FActBench addresses by evaluating four generation tasks across six models using Chain-of-Thought and Natural Language Inference techniques, showing that unanimous voting correlates best with expert assessments.

Authors:Evan Chen, Seyyedali Hosseinalipour, Christopher G. Brinton, David J. Love
Title: Federated Foundation Models in Harsh Wireless Environments: Prospects, Challenges, and Future Directions
Abstract:
Foundation models (FMs) have shown remarkable capabilities in generalized intelligence, multimodal understanding, and adaptive learning across a wide range of domains. However, their deployment in harsh or austere environments -- characterized by intermittent connectivity, limited computation, noisy data, and dynamically changing network topologies -- remains an open challenge. Existing distributed learning methods such as federated learning (FL) struggle to adapt in such settings due to their reliance on stable infrastructure, synchronized updates, and resource-intensive training. In this work, we explore the potential of Federated Foundation Models (FFMs) as a promising paradigm to address these limitations. By integrating the scalability and generalization power of FMs with novel decentralized, communication-aware FL frameworks, we aim to enable robust, energy-efficient, and adaptive intelligence in extreme and adversarial conditions. We present a detailed breakdown of system-level constraints in harsh environments, and discuss the open research challenges in communication design, model robustness, and energy-efficient personalization for these unique settings.
Chinese: 联邦基础模型(FFMs)将基础模型的可扩展性和泛化能力与去中心化、通信感知的联邦学习相结合,旨在在连接有限、资源匮乏的恶劣环境中实现鲁棒且自适应的智能系统。
English: Federated Foundation Models (FFMs) integrate the scalability and generalization of foundation models with decentralized, communication-aware federated learning to enable robust and adaptive intelligence in harsh environments with limited connectivity and resources.

Authors:Shanshan Wang, Junchao Wu, Fengying Ye, Jingming Yao, Lidia S. Chao, Derek F. Wong
Title: Benchmarking the Detection of LLMs-Generated Modern Chinese Poetry
Abstract:
The rapid development of advanced large language models (LLMs) has made AI-generated text indistinguishable from human-written text. Previous work on detecting AI-generated text has made effective progress, but has not involved modern Chinese poetry. Due to the distinctive characteristics of modern Chinese poetry, it is difficult to identify whether a poem originated from humans or AI. The proliferation of AI-generated modern Chinese poetry has significantly disrupted the poetry ecosystem. Based on the urgency of identifying AI-generated poetry in the real Chinese world, this paper proposes a novel benchmark for detecting LLMs-generated modern Chinese poetry. We first construct a high-quality dataset, which includes both 800 poems written by six professional poets and 41,600 poems generated by four mainstream LLMs. Subsequently, we conduct systematic performance assessments of six detectors on this dataset. Experimental results demonstrate that current detectors cannot be used as reliable tools to detect modern Chinese poems generated by LLMs. The most difficult poetic features to detect are intrinsic qualities, especially style. The detection results verify the effectiveness and necessity of our proposed benchmark. Our work lays a foundation for future detection of AI-generated poetry.
中文:现有AI文本检测器难以识别具有独特风格特征的现代汉语诗歌,本文提出的新基准验证了当前工具无法可靠区分AI生成与人类创作的诗歌,为未来检测奠定了基础。
English: Current AI-generated text detectors are ineffective for modern Chinese poetry due to its unique stylistic features, necessitating a new benchmark that reveals existing tools' unreliability in distinguishing AI-composed poems from human works.

Authors:Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
Title: Inducing Faithfulness in Structured Reasoning via Counterfactual Sensitivity
Abstract:
The reasoning processes of large language models often lack faithfulness; a model may generate a correct answer while relying on a flawed or irrelevant reasoning trace. This behavior, a direct consequence of training objectives that solely reward final-answer correctness, severely undermines the trustworthiness of these models in high-stakes domains. This paper introduces \textbf{Counterfactual Sensitivity Regularization (CSR)}, a novel training objective designed to forge a strong, causal-like dependence between a model's output and its intermediate reasoning steps. During training, CSR performs automated, operator-level interventions on the generated reasoning trace (e.g., swapping ``+'' with ``-'') to create a minimally-perturbed counterfactual. A regularization term then penalizes the model if this logically flawed trace still yields the original answer. Our efficient implementation adds only 8.7\% training overhead through warm-start curriculum and token-subset optimization. We evaluate faithfulness using \textbf{Counterfactual Outcome Sensitivity (COS)}, a metric quantifying how sensitive the final answer is to such logical perturbations. Across diverse structured reasoning benchmarks -- arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop QA (HotpotQA), and code generation (MBPP) -- models trained with CSR demonstrate a vastly superior trade-off between accuracy and faithfulness. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, with this learned sensitivity generalizing to larger models and enhancing the performance of inference-time techniques like self-consistency.
中文: 本文提出反事实敏感性正则化(CSR)方法,通过惩罚模型在逻辑错误的推理过程中仍得出正确答案的行为,显著增强了大语言模型推理过程的忠实性,从而提升了模型在多种推理任务中的可信度。
English: This paper introduces Counterfactual Sensitivity Regularization (CSR), a training objective that enhances the faithfulness of large language models by penalizing them when logically flawed reasoning traces still produce correct answers, thereby improving the trustworthiness of their outputs across various reasoning tasks.

Authors:Wen-Chin Huang, Hui Wang, Cheng Liu, Yi-Chiao Wu, Andros Tjandra, Wei-Ning Hsu, Erica Cooper, Yong Qin, Tomoki Toda
Title: The AudioMOS Challenge 2025
Abstract:
This is the summary paper for the AudioMOS Challenge 2025, the very first challenge for automatic subjective quality prediction for synthetic audio. The challenge consists of three tracks. The first track aims to assess text-to-music samples in terms of overall quality and textual alignment. The second track is based on the four evaluation dimensions of Meta Audiobox Aesthetics, and the test set consists of text-to-speech, text-to-audio, and text-to-music samples. The third track focuses on synthetic speech quality assessment in different sampling rates. The challenge attracted 24 unique teams from both academia and industry, and improvements over the baselines were confirmed. The outcome of this challenge is expected to facilitate development and progress in the field of automatic evaluation for audio generation systems.
中文: AudioMOS 2025挑战赛是首个针对合成音频自动主观质量评估的竞赛,设有三个赛道分别评估文本生成音乐、多维度音频美学和不同采样率的语音质量,吸引了24支团队参与并验证了超越基线的改进,将推动音频生成系统评估领域的发展。
English: The AudioMOS Challenge 2025 is the inaugural competition for automatic subjective quality assessment of synthetic audio, featuring three tracks that evaluate text-to-music, multi-dimensional audio aesthetics, and speech quality across sampling rates, with participation from 24 teams demonstrating improvements over baselines to advance audio generation evaluation.

Authors:Runjia Zeng, Guangyan Sun, Qifan Wang, Tong Geng, Sohail Dianat, Xiaotian Han, Raghuveer Rao, Xueling Zhang, Cheng Han, Lifu Huang, Dongfang Liu
Title: MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper
Abstract:
Considering deep neural networks as manifold mappers, the pretrain-then-fine-tune paradigm can be interpreted as a two-stage process: pretrain establishes a broad knowledge base, and fine-tune adjusts the model parameters to activate specific neural pathways to align with the target manifold. Although prior fine-tuning approaches demonstrate success, their rigid parameter space limits their ability to dynamically activate appropriate neural pathways, rendering them ill-equipped to adapt flexibly to the diverse and evolving data distributions. In light of this view, we propose a novel approach, Mixture of Expert Prompt Tuning (MEPT), as an effective and efficient manifold-mapping framework. MEPT leverages the Mixture of Experts architecture by integrating multiple prompt experts to adaptively learn diverse and non-stationary data distributions. Empirical evaluations demonstrate that MEPT outperforms several state-of-the-art parameter efficient baselines on SuperGLUE, achieving notable improvements in mean accuracy (e.g., 1.94%) while significantly reducing activated prompts by 79.25%. The effectiveness of MEPT is further supported by theoretical insights from manifold learning and validated through neural activation pathway visualization results. Our code is avaliable at https://runjia.tech/emnlp_mept/.
中文:提出的专家提示调优混合框架通过多个提示专家动态激活神经通路以适应多样化数据分布,在基准测试中实现了更优的准确率和效率。
English: The proposed Mixture of Expert Prompt Tuning (MEPT) framework dynamically activates neural pathways through multiple prompt experts to adapt to diverse data distributions, achieving superior accuracy and efficiency on benchmark tests.

Authors:Jiacheng Jiang, Yuan Meng, Chen Tang, Han Yu, Qun Li, Zhi Wang, Wenwu Zhu
Title: Quantization Meets OOD: Generalizable Quantization-aware Training from a Flatness Perspective
Abstract:
Current quantization-aware training (QAT) methods primarily focus on enhancing the performance of quantized models on in-distribution (I.D) data, while overlooking the potential performance degradation on out-of-distribution (OOD) data. In this paper, we first substantiate this problem through rigorous experiment, showing that QAT can lead to a significant OOD generalization performance degradation. Further, we find the contradiction between the perspective that flatness of loss landscape gives rise to superior OOD generalization and the phenomenon that QAT lead to a sharp loss landscape, can cause the above problem. Therefore, we propose a flatness-oriented QAT method, FQAT, to achieve generalizable QAT. Specifically, i) FQAT introduces a layer-wise freezing mechanism to mitigate the gradient conflict issue between dual optimization objectives (i.e., vanilla QAT and flatness). ii) FQAT proposes an disorder-guided adaptive freezing algorithm to dynamically determines which layers to freeze at each training step, effectively addressing the challenges caused by interference between layers. A gradient disorder metric is designed to help the algorithm identify unstable layers during training. Extensive experiments on influential OOD benchmark demonstrate the superiority of our method over state-of-the-art baselines under both I.D and OOD image classification tasks.
中文: 现有量化感知训练方法忽视了分布外数据的性能下降,因此本文提出面向平坦化的FQAT方法,通过分层冻结机制提升模型在分布内外数据上的泛化能力。
English: Current quantization-aware training methods overlook out-of-distribution performance degradation, so this paper proposes a flatness-oriented approach called FQAT with layer-wise freezing to enhance generalization across both in-distribution and out-of-distribution data.

Authors:Jialong Zuo, Guangyan Zhang, Minghui Fang, Shengpeng Ji, Xiaoqi Jiao, Jingyu Li, Yiwen Guo, Zhou Zhao
Title: Entropy-based Coarse and Compressed Semantic Speech Representation Learning
Abstract:
Discrete speech representation learning has recently attracted increasing interest in both acoustic and semantic modeling. Existing approaches typically encode 16 kHz waveforms into discrete tokens at a rate of 25 or 50 tokens per second. However, given that speech generally conveys only 2 to 5 words per second, such fine-grained tokenization introduces redundancy and hinders efficiency in downstream training and inference. Moreover, semantic speech representations at this frequency primarily capture phonetic-level information, while semantic understanding may not require such detailed token-level resolution. To address these limitations, we propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations. A speech language model is first pre-trained via next-token prediction on large-scale unlabeled data to capture frequent token patterns. Predictive entropy is then used to adaptively determine aggregation boundaries, followed by a cross-attention module that fuses information within each segment. By adjusting the entropy threshold, the granularity and compression ratio of the representations can be flexibly controlled. Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences, demonstrating the effectiveness of the proposed approach.
中文摘要:本文提出一种基于熵的动态聚合框架,通过根据预测熵自适应合并语音标记来压缩语义表征,在多项语音任务中实现与密集标记序列相当或更优的性能。
English Summary: This paper introduces an entropy-based dynamic aggregation framework that compresses semantic speech representations by adaptively merging tokens based on predictive entropy, achieving comparable or superior performance to dense token sequences in various speech tasks.

Authors:Juraj Vladika, Florian Matthes
Title: MedSEBA: Synthesizing Evidence-Based Answers Grounded in Evolving Medical Literature
Abstract:
In the digital age, people often turn to the Internet in search of medical advice and recommendations. With the increasing volume of online content, it has become difficult to distinguish reliable sources from misleading information. Similarly, millions of medical studies are published every year, making it challenging for researchers to keep track of the latest scientific findings. These evolving studies can reach differing conclusions, which is not reflected in traditional search tools. To address these challenges, we introduce MedSEBA, an interactive AI-powered system for synthesizing evidence-based answers to medical questions. It utilizes the power of Large Language Models to generate coherent and expressive answers, but grounds them in trustworthy medical studies dynamically retrieved from the research database PubMed. The answers consist of key points and arguments, which can be traced back to respective studies. Notably, the platform also provides an overview of the extent to which the most relevant studies support or refute the given medical claim, and a visualization of how the research consensus evolved through time. Our user study revealed that medical experts and lay users find the system usable and helpful, and the provided answers trustworthy and informative. This makes the system well-suited for both everyday health questions and advanced research insights.
中文:MedSEBA是一个基于人工智能的系统,通过将大型语言模型的输出与动态检索自PubMed的医学研究相结合,生成基于证据的医疗答案,并为专家和普通用户提供可追溯的论点及研究共识可视化。
English: MedSEBA is an AI-powered system that synthesizes evidence-based medical answers by grounding Large Language Model outputs in dynamically retrieved PubMed studies, providing traceable arguments and research consensus visualizations for both experts and general users.

Authors:Numan Saeed, Salma Hassan, Shahad Hardan, Ahmed Aly, Darya Taratynova, Umair Nawaz, Ufaq Khan, Muhammad Ridzuan, Vincent Andrearczyk, Adrien Depeursinge, Yutong Xie, Thomas Eugene, Raphaël Metz, Mélanie Dore, Gregory Delpon, Vijay Ram Kumar Papineni, Kareem Wahid, Cem Dede, Alaa Mohamed Shawky Ali, Carlos Sjogreen, Mohamed Naser, Clifton D. Fuller, Valentin Oreiller, Mario Jreige, John O. Prior, Catherine Cheze Le Rest, Olena Tankyevych, Pierre Decazes, Su Ruan, Stephanie Tanadini-Lang, Martin Vallières, Hesham Elhalawani, Ronan Abgral, Romain Floch, Kevin Kerleguer, Ulrike Schick, Maelle Mauguen, David Bourhis, Jean-Christophe Leclere, Amandine Sambourg, Arman Rahmim, Mathieu Hatt, Mohammad Yaqub
Title: A Multimodal and Multi-centric Head and Neck Cancer Dataset for Segmentation, Diagnosis and Outcome Prediction
Abstract:
We present a publicly available multimodal dataset for head and neck cancer research, comprising 1123 annotated Positron Emission Tomography/Computed Tomography (PET/CT) studies from patients with histologically confirmed disease, acquired from 10 international medical centers. All studies contain co-registered PET/CT scans with varying acquisition protocols, reflecting real-world clinical diversity from a long-term, multi-institution retrospective collection. Primary gross tumor volumes (GTVp) and involved lymph nodes (GTVn) were manually segmented by experienced radiation oncologists and radiologists following established guidelines. We provide anonymized NifTi files, expert-annotated segmentation masks, comprehensive clinical metadata, and radiotherapy dose distributions for a patient subset. The metadata include TNM staging, HPV status, demographics, long-term follow-up outcomes, survival times, censoring indicators, and treatment information. To demonstrate its utility, we benchmark three key clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification, using state-of-the-art deep learning models like UNet, SegResNet, and multimodal prognostic frameworks.
中文: 我们发布了一个公开可用的头颈癌研究多模态数据集,包含1,123例带标注的PET/CT研究数据,配有专家分割结果和完整临床元数据,并通过使用先进深度学习模型在肿瘤分割、生存预测和HPV分类任务上的基准测试验证了其实用价值。
English: We introduce a publicly available multimodal dataset for head and neck cancer research, featuring 1,123 annotated PET/CT studies with expert segmentations and comprehensive clinical metadata, and demonstrate its utility through benchmarking on tumor segmentation, survival prediction, and HPV classification tasks using advanced deep learning models.

Authors:Yuqi Li, Chuanguang Yang, Junhao Dong, Zhengtao Yao, Haoyan Xu, Zeyu Dong, Hansheng Zeng, Zhulin An, Yingli Tian
Title: AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models
Abstract:
The success of large-scale visual language pretraining (VLP) models has driven widespread adoption of image-text retrieval tasks. However, their deployment on mobile devices remains limited due to large model sizes and computational complexity. We propose Adaptive Multi-Modal Multi-Teacher Knowledge Distillation (AMMKD), a novel framework that integrates multi-modal feature fusion, multi-teacher distillation, and adaptive optimization to deliver lightweight yet effective retrieval models. Specifically, our method begins with a feature fusion network that extracts and merges discriminative features from both the image and text modalities. To reduce model parameters and further improve performance, we design a multi-teacher knowledge distillation framework to pre-train two CLIP teacher models. We decouple modalities by pre-computing and storing text features as class vectors via the teacher text encoder to enhance efficiency. To better align teacher and student outputs, we apply KL scatter for probability distribution matching. Finally, we design an adaptive dynamic weighting scheme that treats multi-teacher distillation as a multi-objective optimization problem. By leveraging gradient space diversity, we dynamically adjust the influence of each teacher, reducing conflicts and guiding the student toward more optimal learning directions. Extensive experiments on three benchmark datasets demonstrate that AMMKD achieves superior performance while significantly reducing model complexity, validating its effectiveness and flexibility.
中文:AMMKD框架通过多模态融合、多教师蒸馏和自适应优化,开发出高效轻量的图文检索模型,在基准测试中以更低复杂度实现了优越性能。
English: The AMMKD framework introduces multi-modal fusion, multi-teacher distillation, and adaptive optimization to create efficient, lightweight models for image-text retrieval, achieving high performance with reduced complexity on benchmark tests.

Authors:Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, Yezhou Yang, Varun Jampani
Title: Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
Abstract:
Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.
中文:Stable Cinemetrics 提出了一个结构化评估框架,包含四个分层分类法来评估专业视频生成,通过大规模人工研究和自动化评估器揭示了当前模型存在显著不足。
English: Stable Cinemetrics introduces a structured evaluation framework with four hierarchical taxonomies to assess professional video generation, revealing significant gaps in current models through large-scale human studies and an automated evaluator.

Authors:Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su
Title: Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization
Abstract:
Mixture-of-Experts (MoE) has emerged as a promising paradigm for efficiently scaling large language models without a proportional increase in computational cost. However, the standard training strategy of Top-K router prevents MoE models from realizing their full potential for elastic inference. When the number of activated experts is altered at inference time, these models exhibit precipitous performance degradation. In this work, we introduce Matryoshka MoE (M-MoE), a training framework that instills a coarse-to-fine structure directly into the expert ensemble. By systematically varying the number of activated experts during training, M-MoE compels the model to learn a meaningful ranking: top-ranked experts collaborate to provide essential, coarse-grained capabilities, while subsequent experts add progressively finer-grained detail. We explore this principle at multiple granularities, identifying a layer-wise randomization strategy as the most effective. Our experiments demonstrate that a single M-MoE model achieves remarkable elasticity, with its performance at various expert counts closely matching that of an entire suite of specialist models, but at only a fraction of the total training cost. This flexibility not only unlocks elastic inference but also enables optimizing performance by allocating different computational budgets to different model layers. Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
Chinese: Matryoshka MoE (M-MoE) 框架通过粗到细的训练方法,使单一模型能够实现弹性推理,在不同计算预算下达到多个专业模型的性能,同时大幅降低训练成本。
English: The Matryoshka MoE (M-MoE) framework introduces a coarse-to-fine training approach that enables a single model to achieve elastic inference, matching the performance of multiple specialist models at various computational budgets while significantly reducing training costs.

Authors:Zhenyue Qin, Yang Liu, Yu Yin, Jinyu Ding, Haoran Zhang, Anran Li, Dylan Campbell, Xuansheng Wu, Ke Zou, Tiarnan D. L. Keenan, Emily Y. Chew, Zhiyong Lu, Yih-Chung Tham, Ninghao Liu, Xiuzhen Zhang, Qingyu Chen
Title: LMOD+: A Comprehensive Multimodal Dataset and Benchmark for Developing and Evaluating Multimodal Large Language Models in Ophthalmology
Abstract:
Vision-threatening eye diseases pose a major global health burden, with timely diagnosis limited by workforce shortages and restricted access to specialized care. While multimodal large language models (MLLMs) show promise for medical image interpretation, advancing MLLMs for ophthalmology is hindered by the lack of comprehensive benchmark datasets suitable for evaluating generative models. We present a large-scale multimodal ophthalmology benchmark comprising 32,633 instances with multi-granular annotations across 12 common ophthalmic conditions and 5 imaging modalities. The dataset integrates imaging, anatomical structures, demographics, and free-text annotations, supporting anatomical structure recognition, disease screening, disease staging, and demographic prediction for bias evaluation. This work extends our preliminary LMOD benchmark with three major enhancements: (1) nearly 50% dataset expansion with substantial enlargement of color fundus photography; (2) broadened task coverage including binary disease diagnosis, multi-class diagnosis, severity classification with international grading standards, and demographic prediction; and (3) systematic evaluation of 24 state-of-the-art MLLMs. Our evaluations reveal both promise and limitations. Top-performing models achieved ~58% accuracy in disease screening under zero-shot settings, and performance remained suboptimal for challenging tasks like disease staging. We will publicly release the dataset, curation pipeline, and leaderboard to potentially advance ophthalmic AI applications and reduce the global burden of vision-threatening diseases.
Chinese: 本研究提出了一个大规模多模态眼科基准数据集,旨在解决眼科生成模型评估中缺乏全面数据的问题,并通过系统评估揭示了当前多模态大语言模型在疾病诊断中的潜力与局限。
English: This work introduces a large-scale multimodal ophthalmology benchmark to address the lack of comprehensive datasets for evaluating generative models in diagnosing vision-threatening diseases, demonstrating both the potential and limitations of current MLLMs through systematic evaluation.

Authors:Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, Yangyang Shi
Title: DepthLM: Metric Depth From Vision Language Models
Abstract:
Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck for VLMs lies actually in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. Interestingly, without explicit enforcement during training, VLMs trained with DepthLM naturally avoids over-smoothing, having much fewer flying points at boundary regions than pure vision models. The simplicity of DepthLM also enables a single VLM to cover various 3D tasks beyond metric depth. Our code and model will be released at the link below.
中文: 视觉语言模型通过基于文本的稀疏标签微调即可在三维理解任务中达到专家级精度,无需复杂架构调整便能超越先进视觉语言模型并与纯视觉模型相媲美。
English: Vision language models can achieve expert-level accuracy in 3D understanding tasks like metric depth estimation through simple text-based fine-tuning with sparse labels, surpassing advanced VLMs and matching pure vision models without complex architectural changes.

Authors:Haozhe Jia, Wenshuo Chen, Yuqi Lin, Yang Yang, Lei Wang, Mang Ning, Bowen Tian, Songning Lai, Nanqian Jia, Yifan Chen, Yutao Yue
Title: LUMA: Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model
Abstract:
While current diffusion-based models, typically built on U-Net architectures, have shown promising results on the text-to-motion generation task, they still suffer from semantic misalignment and kinematic artifacts. Through analysis, we identify severe gradient attenuation in the deep layers of the network as a key bottleneck, leading to insufficient learning of high-level features. To address this issue, we propose \textbf{LUMA} (\textit{\textbf{L}ow-dimension \textbf{U}nified \textbf{M}otion \textbf{A}lignment}), a text-to-motion diffusion model that incorporates dual-path anchoring to enhance semantic alignment. The first path incorporates a lightweight MoCLIP model trained via contrastive learning without relying on external data, offering semantic supervision in the temporal domain. The second path introduces complementary alignment signals in the frequency domain, extracted from low-frequency DCT components known for their rich semantic content. These two anchors are adaptively fused through a temporal modulation mechanism, allowing the model to progressively transition from coarse alignment to fine-grained semantic refinement throughout the denoising process. Experimental results on HumanML3D and KIT-ML demonstrate that LUMA achieves state-of-the-art performance, with FID scores of 0.035 and 0.123, respectively. Furthermore, LUMA accelerates convergence by 1.4$\times$ compared to the baseline, making it an efficient and scalable solution for high-fidelity text-to-motion generation.
中文:LUMA模型通过双路径锚定和时间调制机制解决文本驱动动作生成的语义错位问题,在实现最优性能的同时显著加速了收敛过程。
English: The proposed LUMA model addresses semantic misalignment and kinematic artifacts in text-to-motion generation through dual-path anchoring and temporal modulation, achieving state-of-the-art performance with accelerated convergence.

Authors:Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen, Chen Lai, Sheng Cao, Yuandong Tian, Raghuraman Krishnamoorthi, Yangyang Shi, Vikas Chandra
Title: MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
Abstract:
The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.
中文: 本研究通过精心筛选约2T高质量数据,证明了无需大规模语料即可在小参数模型中激发推理能力,所开发的MobileLLM-R1系列模型在多项基准测试中表现优于或媲美基于更庞大数据的模型,颠覆了推理能力必须依赖海量数据的传统认知。
English: This study challenges the assumption that large-scale data is essential for reasoning capabilities in language models by demonstrating that high-quality, curated datasets of only ~2T tokens can produce sub-billion-parameter models like MobileLLM-R1, which outperform or match larger models trained on significantly more data.

Authors:Xiaohe Bo, Rui Li, Zexu Sun, Quanyu Dai, Zeyu Zhang, Zihang Tian, Xu Chen, Zhenhua Dong
Title: Prompt and Parameter Co-Optimization for Large Language Models
Abstract:
Prompt optimization and fine-tuning are two major approaches to improve the performance of Large Language Models (LLMs). They enhance the capabilities of LLMs from complementary perspectives: the former through explicit natural language, and the latter through implicit parameter updates. However, prior work has typically studied them in isolation, leaving their synergistic potential largely underexplored. To bridge this gap, in this paper, we introduce MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning for LLM training. Specifically, we introduce two neural networks to generate prompts and parameters, respectively, while allowing them to share a common bottom encoding layer to enable knowledge sharing. By the guidance of the final supervised signals, our framework is optimized to discover the optimal combinations between the prompts and parameters. Given that prompt learning involves discrete optimization while fine-tuning operates in a continuous parameter space, we design a supervised regularization loss to train our framework effectively. Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.
Chinese: 本文提出MetaTuner框架,通过共享编码层和监督正则化损失,将提示优化与微调相结合,使大语言模型在离散和连续优化空间中协同提升性能,实验证明其优于现有基准方法。
English: This paper introduces MetaTuner, a novel framework that jointly integrates prompt optimization and fine-tuning to enhance Large Language Models by enabling knowledge sharing and discovering optimal prompt-parameter combinations through supervised regularization.

Authors:Nada Bouchekout, Abdelkrim Boukabou, Morad Grimes, Yassine Habchi, Yassine Himeur, Hamzah Ali Alkhazaleh, Shadi Atalla, Wathiq Mansoor
Title: A Novel Hybrid Deep Learning and Chaotic Dynamics Approach for Thyroid Cancer Classification
Abstract:
Timely and accurate diagnosis is crucial in addressing the global rise in thyroid cancer, ensuring effective treatment strategies and improved patient outcomes. We present an intelligent classification method that couples an Adaptive Convolutional Neural Network (CNN) with Cohen-Daubechies-Feauveau (CDF9/7) wavelets whose detail coefficients are modulated by an n-scroll chaotic system to enrich discriminative features. We evaluate on the public DDTI thyroid ultrasound dataset (n = 1,638 images; 819 malignant / 819 benign) using 5-fold cross-validation, where the proposed method attains 98.17% accuracy, 98.76% sensitivity, 97.58% specificity, 97.55% F1-score, and an AUC of 0.9912. A controlled ablation shows that adding chaotic modulation to CDF9/7 improves accuracy by +8.79 percentage points over a CDF9/7-only CNN (from 89.38% to 98.17%). To objectively position our approach, we trained state-of-the-art backbones on the same data and splits: EfficientNetV2-S (96.58% accuracy; AUC 0.987), Swin-T (96.41%; 0.986), ViT-B/16 (95.72%; 0.983), and ConvNeXt-T (96.94%; 0.987). Our method outperforms the best of these by +1.23 points in accuracy and +0.0042 in AUC, while remaining computationally efficient (28.7 ms per image; 1,125 MB peak VRAM). Robustness is further supported by cross-dataset testing on TCIA (accuracy 95.82%) and transfer to an ISIC skin-lesion subset (n = 28 unique images, augmented to 2,048; accuracy 97.31%). Explainability analyses (Grad-CAM, SHAP, LIME) highlight clinically relevant regions. Altogether, the wavelet-chaos-CNN pipeline delivers state-of-the-art thyroid ultrasound classification with strong generalization and practical runtime characteristics suitable for clinical integration.
中文: 本研究提出的小波混沌CNN方法在甲状腺超声分类中达到98.17%的准确率,在超越现有模型的同时保持了计算效率和临床可解释性,展现了优异的泛化能力。
English: This study introduces a wavelet-chaos-CNN method that achieves 98.17% accuracy in thyroid ultrasound classification, outperforming existing models while maintaining computational efficiency and clinical interpretability.

Authors:Anyi Wang, Xuansheng Wu, Dong Shu, Yunpu Ma, Ninghao Liu
Title: Enhancing LLM Steering through Sparse Autoencoder-Based Vector Refinement
Abstract:
Steering has emerged as a promising approach in controlling large language models (LLMs) without modifying model parameters. However, most existing steering methods rely on large-scale datasets to learn clear behavioral information, which limits their applicability in many real-world scenarios. The steering vectors extracted from small dataset often contain task-irrelevant noising features, which degrades their effectiveness. To refine the steering vectors learned from limited data, we introduce Refinement of Steering Vector via Sparse Autoencoder (SAE-RSV) that leverages SAEs to semantically denoise and augment the steering vectors. In our framework, we first remove task-irrelevant features according to their semantics provided by SAEs, and then enrich task-relevant features missing from the small dataset through their semantic similarity to the identified relevant features. Extensive experiments demonstrate that the proposed SAE-RSV substantially outperforms all the baseline methods including supervised fine-tuning. Our findings show that effective steering vector can be constructed from limited training data by refining the original steering vector through SAEs.
中文摘要:SAE-RSV方法通过稀疏自编码器对有限数据中提取的引导向量进行语义去噪和特征增强,有效提升了语言模型控制性能并显著优于现有基线方法。
English Summary: The SAE-RSV method refines steering vectors for large language models by using sparse autoencoders to remove noise and enhance relevant features from limited data, significantly improving performance over existing methods.

Authors:Peng Yu, Zeyuan Zhao, Shao Zhang, Luoyi Fu, Xinbing Wang, Ying Wen
Title: Learning to Reason in Structured In-context Environments with Reinforcement Learning
Abstract:
Large language models (LLMs) have achieved significant advancements in reasoning capabilities through reinforcement learning (RL) via environmental exploration. As the intrinsic properties of the environment determine the abilities that LLMs can learn, the environment plays a important role in the RL finetuning process. An ideal LLM reasoning environment should possess three core characteristics: scalability, generalizable reasoning, and verifiability. However, existing mathematical and coding environments are difficult to scale due to heavy reliance on expert annotation, while the skills learned in game-based environments are too specialized to generalize. To bridge this gap, we introduce the \textbf{S}tructured \textbf{I}n-context \textbf{E}nvironment (SIE) framework. SIE achieves scalability by automatically constructing reasoning environments from large-scale structured data, where the rich compositional patterns naturally support generalizable reasoning. Moreover, the explicit schemas and reasoning chains in structured data provide a foundation for rule-based verifiability. Experimental results show that SIE framework not only achieves substantial improvements in in-domain structured reasoning, but also enables the learned compositional reasoning skills to generalize effectively to out-of-domain mathematical and logical reasoning tasks. We further explored learning in information-limited partial SIEs and found that LLMs can infer the missing information through exploring the environment, leading to robust reasoning improvements and generalization performance.
中文: SIE框架通过从大规模结构化数据自动构建可扩展的推理环境,使大语言模型能够学习可泛化的组合推理技能,在领域内外任务中均实现显著性能提升。
English: The SIE framework automatically creates scalable reasoning environments from structured data, enabling LLMs to learn generalizable compositional reasoning skills that improve performance on both in-domain and out-of-domain tasks.

Authors:Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang
Title: Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time
Abstract:
Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.
中文摘要:测试时缩放通过增加推理计算提升大语言模型推理能力,而动态专家搜索利用混合专家模型中可控的专家激活,在不增加成本的情况下提高推理准确性和稳定性。
English Summary: Test-Time Scaling (TTS) improves LLM reasoning through inference computation, and the proposed Dynamic Experts Search (DES) leverages controllable expert activation in Mixture-of-Experts models to enhance reasoning accuracy and stability without extra cost.

Authors:Mohammad Abbadi, Yassine Himeur, Shadi Atalla, Dahlia Mansoor, Wathiq Mansoor
Title: LLM-Augmented and Fair Machine Learning Framework for University Admission Prediction
Abstract:
Universities face surging applications and heightened expectations for fairness, making accurate admission prediction increasingly vital. This work presents a comprehensive framework that fuses machine learning, deep learning, and large language model techniques to combine structured academic and demographic variables with unstructured text signals. Drawing on more than 2,000 student records, the study benchmarks logistic regression, Naive Bayes, random forests, deep neural networks, and a stacked ensemble. Logistic regression offers a strong, interpretable baseline at 89.5% accuracy, while the stacked ensemble achieves the best performance at 91.0%, with Naive Bayes and random forests close behind. To probe text integration, GPT-4-simulated evaluations of personal statements are added as features, yielding modest gains but demonstrating feasibility for authentic essays and recommendation letters. Transparency is ensured through feature-importance visualizations and fairness audits. The audits reveal a 9% gender gap (67% male vs. 76% female) and an 11% gap by parental education, underscoring the need for continued monitoring. The framework is interpretable, fairness-aware, and deployable.
中文摘要:本研究开发了一个可解释且关注公平性的录取预测框架,通过融合机器学习与文本分析技术,在实现91%预测准确率的同时,识别出需要持续监控的人口统计差异。
English Summary: This study develops an interpretable and fairness-aware admission prediction framework that integrates machine learning with text analysis, achieving 91% accuracy while identifying demographic disparities requiring ongoing monitoring.

Authors:Zhe Zhu, Le Wan, Rui Xu, Yiheng Zhang, Honghua Chen, Zhiyang Dou, Cheng Lin, Yuan Liu, Mingqiang Wei
Title: PartSAM: A Scalable Promptable Part Segmentation Model Trained on Native 3D Data
Abstract:
Segmenting 3D objects into parts is a long-standing challenge in computer vision. To overcome taxonomy constraints and generalize to unseen 3D objects, recent works turn to open-world part segmentation. These approaches typically transfer supervision from 2D foundation models, such as SAM, by lifting multi-view masks into 3D. However, this indirect paradigm fails to capture intrinsic geometry, leading to surface-only understanding, uncontrolled decomposition, and limited generalization. We present PartSAM, the first promptable part segmentation model trained natively on large-scale 3D data. Following the design philosophy of SAM, PartSAM employs an encoder-decoder architecture in which a triplane-based dual-branch encoder produces spatially structured tokens for scalable part-aware representation learning. To enable large-scale supervision, we further introduce a model-in-the-loop annotation pipeline that curates over five million 3D shape-part pairs from online assets, providing diverse and fine-grained labels. This combination of scalable architecture and diverse 3D data yields emergent open-world capabilities: with a single prompt, PartSAM achieves highly accurate part identification, and in a Segment-Every-Part mode, it automatically decomposes shapes into both surface and internal structures. Extensive experiments show that PartSAM outperforms state-of-the-art methods by large margins across multiple benchmarks, marking a decisive step toward foundation models for 3D part understanding.
中文:PartSAM是一种创新的可提示三维部件分割模型,通过直接学习大规模三维数据,能够精确识别表面与内部结构,并在多个基准测试中大幅超越现有方法。
English: PartSAM is a novel promptable 3D part segmentation model that directly learns from large-scale 3D data, enabling accurate identification of both surface and internal structures while significantly outperforming existing methods.

Authors:Yun Wang, Zhaojun Ding, Xuansheng Wu, Siyue Sun, Ninghao Liu, Xiaoming Zhai
Title: AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition
Abstract:
Automated scoring plays a crucial role in education by reducing the reliance on human raters, offering scalable and immediate evaluation of student work. While large language models (LLMs) have shown strong potential in this task, their use as end-to-end raters faces challenges such as low accuracy, prompt sensitivity, limited interpretability, and rubric misalignment. These issues hinder the implementation of LLM-based automated scoring in assessment practice. To address the limitations, we propose AutoSCORE, a multi-agent LLM framework enhancing automated scoring via rubric-aligned Structured COmponent REcognition. With two agents, AutoSCORE first extracts rubric-relevant components from student responses and encodes them into a structured representation (i.e., Scoring Rubric Component Extraction Agent), which is then used to assign final scores (i.e., Scoring Agent). This design ensures that model reasoning follows a human-like grading process, enhancing interpretability and robustness. We evaluate AutoSCORE on four benchmark datasets from the ASAP benchmark, using both proprietary and open-source LLMs (GPT-4o, LLaMA-3.1-8B, and LLaMA-3.1-70B). Across diverse tasks and rubrics, AutoSCORE consistently improves scoring accuracy, human-machine agreement (QWK, correlations), and error metrics (MAE, RMSE) compared to single-agent baselines, with particularly strong benefits on complex, multi-dimensional rubrics, and especially large relative gains on smaller LLMs. These results demonstrate that structured component recognition combined with multi-agent design offers a scalable, reliable, and interpretable solution for automated scoring.
中文摘要:AutoSCORE是一种多智能体大语言模型框架,通过先提取学生回答中与评分标准相关的结构化成分再进行评分,有效提升了自动评分的准确性、可解释性和鲁棒性,在不同任务和模型上均展现出稳定改进。
English Summary: AutoSCORE is a multi-agent LLM framework that enhances automated scoring accuracy and interpretability by first extracting rubric-aligned components from student responses before assigning final scores, demonstrating consistent improvements across diverse tasks and models.

Authors:Haodong Zhao, Jidong Li, Zhaomin Wu, Tianjie Ju, Zhuosheng Zhang, Bingsheng He, Gongshen Liu
Title: Disagreements in Reasoning: How a Model's Thinking Process Dictates Persuasion in Multi-Agent Systems
Abstract:
The rapid proliferation of recent Multi-Agent Systems (MAS), where Large Language Models (LLMs) and Large Reasoning Models (LRMs) usually collaborate to solve complex problems, necessitates a deep understanding of the persuasion dynamics that govern their interactions. This paper challenges the prevailing hypothesis that persuasive efficacy is primarily a function of model scale. We propose instead that these dynamics are fundamentally dictated by a model's underlying cognitive process, especially its capacity for explicit reasoning. Through a series of multi-agent persuasion experiments, we uncover a fundamental trade-off we term the Persuasion Duality. Our findings reveal that the reasoning process in LRMs exhibits significantly greater resistance to persuasion, maintaining their initial beliefs more robustly. Conversely, making this reasoning process transparent by sharing the "thinking content" dramatically increases their ability to persuade others. We further consider more complex transmission persuasion situations and reveal complex dynamics of influence propagation and decay within multi-hop persuasion between multiple agent networks. This research provides systematic evidence linking a model's internal processing architecture to its external persuasive behavior, offering a novel explanation for the susceptibility of advanced models and highlighting critical implications for the safety, robustness, and design of future MAS.
中文摘要:本研究挑战了模型规模决定说服力的主流观点,揭示多智能体系统中的说服机制实由内部推理过程主导:透明推理显著增强说服力,而严密推理则大幅提升抗说服能力。
English Summary: This study challenges the view that model size determines persuasive power, revealing instead that persuasion dynamics in multi-agent systems are governed by internal reasoning processes, where transparent reasoning enhances persuasiveness while robust reasoning increases resistance.

Authors:Srishti Gureja, Elena Tommasone, Jingyi He, Sara Hooker, Matthias Gallé, Marzieh Fadaee
Title: Verification Limits Code LLM Training
Abstract:
Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models. While this enables scalable data creation, it introduces a previously unexplored bottleneck: the verification ceiling, in which the quality and diversity of training data are fundamentally constrained by the capabilities of synthetic verifiers. In this work, we systematically study how verification design and strategies influence model performance. We investigate (i) what we verify by analyzing the impact of test complexity and quantity: richer test suites improve code generation capabilities (on average +3 pass@1), while quantity alone yields diminishing returns, (ii) how we verify by exploring relaxed pass thresholds: rigid 100% pass criteria can be overly restrictive. By allowing for relaxed thresholds or incorporating LLM-based soft verification, we can recover valuable training data, leading to a 2-4 point improvement in pass@1 performance. However, this benefit is contingent upon the strength and diversity of the test cases used, and (iii) why verification remains necessary through controlled comparisons of formally correct versus incorrect solutions and human evaluation: retaining diverse correct solutions per problem yields consistent generalization gains. Our results show that Verification as currently practiced is too rigid, filtering out valuable diversity. But it cannot be discarded, only recalibrated. By combining calibrated verification with diverse, challenging problem-solution pairs, we outline a path to break the verification ceiling and unlock stronger code generation models.
中文: 研究表明,当前代码生成模型合成数据中严格的验证方法限制了数据的多样性和质量,提出通过校准验证结合多样化测试用例来突破这一瓶颈,从而提升模型性能。
English: The study reveals that current rigid verification methods in synthetic data creation for code generation models limit data diversity and quality, proposing calibrated verification with diverse test cases to overcome these constraints and enhance model performance.

Authors:Fanchen Bu, Geon Lee, Minyoung Choe, Kijung Shin
Title: Identifying Group Anchors in Real-World Group Interactions Under Label Scarcity
Abstract:
Group interactions occur in various real-world contexts, e.g., co-authorship, email communication, and online Q&A. In each group, there is often a particularly significant member, around whom the group is formed. Examples include the first or last author of a paper, the sender of an email, and the questioner in a Q&A session. In this work, we discuss the existence of such individuals in real-world group interactions. We call such individuals group anchors and study the problem of identifying them. First, we introduce the concept of group anchors and the identification problem. Then, we discuss our observations on group anchors in real-world group interactions. Based on our observations, we develop AnchorRadar, a fast and effective method for group anchor identification under realistic settings with label scarcity, i.e., when only a few groups have known anchors. AnchorRadar is a semi-supervised method using information from groups both with and without known group anchors. Finally, through extensive experiments on thirteen real-world datasets, we demonstrate the empirical superiority of AnchorRadar over various baselines w.r.t. accuracy and efficiency. In most cases, AnchorRadar achieves higher accuracy in group anchor identification than all the baselines, while using 10.2$\times$ less training time than the fastest baseline and 43.6$\times$ fewer learnable parameters than the most lightweight baseline on average.
Chinese: 本研究提出了群体锚点的概念,即围绕其形成群体的关键个体,并开发了AnchorRadar这一半监督方法,即使在标签稀缺的情况下也能高效、准确地识别锚点,且所需资源和时间显著少于现有方法。
English: This study introduces the concept of group anchors—key individuals around whom groups form—and proposes AnchorRadar, a semi-supervised method that efficiently identifies them with high accuracy and minimal resources, even with limited labeled data.

Authors:Fanchen Bu, Geon Lee, Minyoung Choe, Kijung Shin
Title: Identifying Group Anchors in Real-World Group Interactions Under Label Scarcity
Abstract:
Group interactions occur in various real-world contexts, e.g., co-authorship, email communication, and online Q&A. In each group, there is often a particularly significant member, around whom the group is formed. Examples include the first or last author of a paper, the sender of an email, and the questioner in a Q&A session. In this work, we discuss the existence of such individuals in real-world group interactions. We call such individuals group anchors and study the problem of identifying them. First, we introduce the concept of group anchors and the identification problem. Then, we discuss our observations on group anchors in real-world group interactions. Based on our observations, we develop AnchorRadar, a fast and effective method for group anchor identification under realistic settings with label scarcity, i.e., when only a few groups have known anchors. AnchorRadar is a semi-supervised method using information from groups both with and without known group anchors. Finally, through extensive experiments on thirteen real-world datasets, we demonstrate the empirical superiority of AnchorRadar over various baselines w.r.t. accuracy and efficiency. In most cases, AnchorRadar achieves higher accuracy in group anchor identification than all the baselines, while using 10.2$\times$ less training time than the fastest baseline and 43.6$\times$ fewer learnable parameters than the most lightweight baseline on average.
Chinese: 本研究提出了群体锚点的概念,即围绕其形成群体的关键个体,并开发了AnchorRadar这一半监督方法,即使在标签稀缺的情况下也能高效、准确地识别锚点,且所需资源和时间显著少于现有方法。
English: This study introduces the concept of group anchors—key individuals around whom groups form—and proposes AnchorRadar, a semi-supervised method that efficiently identifies them with high accuracy and minimal resources, even with limited labeled data.

Authors:Minoo Dolatabadi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Title: Towards Robust LiDAR Localization: Deep Learning-based Uncertainty Estimation
Abstract:
LiDAR-based localization and SLAM often rely on iterative matching algorithms, particularly the Iterative Closest Point (ICP) algorithm, to align sensor data with pre-existing maps or previous scans. However, ICP is prone to errors in featureless environments and dynamic scenes, leading to inaccurate pose estimation. Accurately predicting the uncertainty associated with ICP is crucial for robust state estimation but remains challenging, as existing approaches often rely on handcrafted models or simplified assumptions. Moreover, a few deep learning-based methods for localizability estimation either depend on a pre-built map, which may not always be available, or provide a binary classification of localizable versus non-localizable, which fails to properly model uncertainty. In this work, we propose a data-driven framework that leverages deep learning to estimate the registration error covariance of ICP before matching, even in the absence of a reference map. By associating each LiDAR scan with a reliable 6-DoF error covariance estimate, our method enables seamless integration of ICP within Kalman filtering, enhancing localization accuracy and robustness. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our approach, showing that it accurately predicts covariance and, when applied to localization using a pre-built map or SLAM, reduces localization errors and improves robustness.
中文摘要:该研究提出的深度学习框架可在无先验地图情况下预先预测ICP配准误差协方差,通过卡尔曼滤波提升定位精度与鲁棒性,KITTI数据集实验验证了其有效性。
English Summary: The proposed deep learning framework predicts the registration error covariance of ICP prior to matching, enabling enhanced localization accuracy and robustness in Kalman filtering without requiring pre-existing maps.

Authors:Haodong Zhao, Chenyan Zhao, Yansi Li, Zhuosheng Zhang, Gongshen Liu
Title: Thinking in a Crowd: How Auxiliary Information Shapes LLM Reasoning
Abstract:
The capacity of Large Language Models (LLMs) to reason is fundamental to their application in complex, knowledge-intensive domains. In real-world scenarios, LLMs are often augmented with external information that can be helpful, irrelevant, or even misleading. This paper investigates the causal impact of such auxiliary information on the reasoning process of LLMs with explicit step-by-step thinking capabilities. We introduce SciAux, a new dataset derived from ScienceQA, to systematically test the robustness of the model against these types of information. Our findings reveal a critical vulnerability: the model's deliberative "thinking mode" is a double-edged sword. While helpful context improves accuracy, misleading information causes a catastrophic drop in performance, which is amplified by the thinking process. Instead of conferring robustness, thinking reinforces the degree of error when provided with misinformation. This highlights that the challenge is not merely to make models "think", but to endow them with the critical faculty to evaluate the information upon which their reasoning is based. The SciAux dataset is available at https://huggingface.co/datasets/billhdzhao/SciAux.
Chinese: 研究表明,大型语言模型的逐步推理能力在有益信息下可提升准确性,但面对误导性信息时会灾难性地放大错误,凸显了人工智能必须具备批判性信息评估能力。
English: The study reveals that while large language models' step-by-step reasoning can enhance accuracy with helpful context, it catastrophically amplifies errors when exposed to misleading information, underscoring the need for critical evaluation skills in AI.

Authors:Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, Zhaoxiang Zhang
Title: DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving
Abstract:
End-to-end autonomous driving has substantially progressed by directly predicting future trajectories from raw perception inputs, which bypasses traditional modular pipelines. However, mainstream methods trained via imitation learning suffer from critical safety limitations, as they fail to distinguish between trajectories that appear human-like but are potentially unsafe. Some recent approaches attempt to address this by regressing multiple rule-driven scores but decoupling supervision from policy optimization, resulting in suboptimal performance. To tackle these challenges, we propose DriveDPO, a Safety Direct Preference Optimization Policy Learning framework. First, we distill a unified policy distribution from human imitation similarity and rule-based safety scores for direct policy optimization. Further, we introduce an iterative Direct Preference Optimization stage formulated as trajectory-level preference alignment. Extensive experiments on the NAVSIM benchmark demonstrate that DriveDPO achieves a new state-of-the-art PDMS of 90.0. Furthermore, qualitative results across diverse challenging scenarios highlight DriveDPO's ability to produce safer and more reliable driving behaviors.
中文:DriveDPO提出了一种以安全为核心的框架,通过直接偏好优化整合人类模仿与安全评分来优化自动驾驶策略,在多样化场景中实现了顶尖性能并显著提升了驾驶可靠性。
English: DriveDPO introduces a safety-focused framework that optimizes autonomous driving policies by integrating human imitation and safety scores through direct preference optimization, achieving state-of-the-art performance and enhanced reliability in diverse scenarios.

Authors:Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Title: Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Abstract:
Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.
中文: 该研究提出无需训练的推理时令牌淘汰策略,通过保留关键信息令牌来限制内存增长,在多种3D感知任务中大幅降低内存使用的同时保持了精度。
English: The proposed training-free token eviction policy effectively bounds KV memory growth in streaming visual transformers by discarding redundant tokens, maintaining high accuracy while drastically reducing memory usage across various 3D perception tasks.

Authors:Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Title: Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Abstract:
Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.
中文: 该研究提出无需训练的推理时令牌淘汰策略,通过保留关键信息令牌来限制内存增长,在多种3D感知任务中大幅降低内存使用的同时保持了精度。
English: The proposed training-free token eviction policy effectively bounds KV memory growth in streaming visual transformers by discarding redundant tokens, maintaining high accuracy while drastically reducing memory usage across various 3D perception tasks.

Authors:Yujie Feng, Jian Li, Xiaoyu Dong, Pengfei Xu, Xiaohui Zhou, Yujia Zhang, Zexin LU, Yasha Wang, Alan Zhao, Xu Chu, Xiao-Ming Wu
Title: AIMMerging: Adaptive Iterative Model Merging Using Training Trajectories for Language Model Continual Learning
Abstract:
Continual learning (CL) is essential for deploying large language models (LLMs) in dynamic real-world environments without the need for costly retraining. Recent model merging-based methods have attracted significant attention, but they still struggle to effectively manage the trade-off between learning new knowledge and preventing forgetting, a challenge largely stemming from suboptimal number of merges and merging frequency. In this paper, we introduce Adaptive Iterative Model Merging (AimMerging), a novel CL framework that utilizes learning and forgetting signals from the training trajectory to dynamically monitor the model's training status. Guided by dynamic monitoring, the training trajectory-guided merge controller adaptively determines the timing and frequency of iterative fusion, while the rehearsal-based knowledge fusion module computes the merging weights and executes the fusion. Comprehensive experiments on three CL benchmarks with various model sizes (from 770M to 13B) demonstrate that AimMerging achieves significant performance improvements over existing state-of-the-art methods, with an average relative improvement of 80% and 59% on FWT and BWT, respectively. The source code is provided for reproducibility.
中文:AimMerging是一种新颖的持续学习框架,通过训练轨迹信号动态调整模型融合时机与频率,在FWT和BWT指标上分别实现了80%和59%的相对性能提升,达到最先进水平。
English: AimMerging is a novel continual learning framework that dynamically adjusts model merging timing and frequency using training trajectory signals, achieving state-of-the-art performance with 80% and 59% relative improvements on FWT and BWT metrics respectively.

Authors:Hongxin Li, Jingran Su, Jingfan Chen, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang
Title: UIPro: Unleashing Superior Interaction Capability For GUI Agents
Abstract:
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes \textbf{UIPro}, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability, which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro's superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach.
中文: 本文提出UIPro通用图形界面代理,通过整合多平台数据和统一动作空间解决现有方法的局限性,在各类图形界面任务基准测试中均表现出卓越性能。
English: This paper introduces UIPro, a generalist GUI agent trained on extensive multi-platform data with a unified action space to overcome limitations in existing methods, achieving superior performance across various GUI task benchmarks.

Authors:Vatsal Malaviya, Agneet Chatterjee, Maitreya Patel, Yezhou Yang, Chitta Baral
Title: AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models
Abstract:
Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.
中文: 当前文本到图像模型在生成以动作为核心的场景时因缺乏对细微属性的充分表征而表现不佳,但我们提出的无需训练方法利用大语言模型增强提示的时间细节,使生成准确率提升了72%。
English: Current Text-to-Image models struggle with accurately generating action-centric scenes due to insufficient representation of nuanced attributes, but our proposed training-free method using Large Language Models to enrich prompts with temporal details achieves a 72% improvement in generation accuracy.

Authors:Jisoo Lee, Michael R. Harowicz, Yuwen Chen, Hanxue Gu, Isaac S. Alderete, Lin Li, Maciej A. Mazurowski, Matthew G. Hartwig
Title: Transplant-Ready? Evaluating AI Lung Segmentation Models in Candidates with Severe Lung Disease
Abstract:
This study evaluates publicly available deep-learning based lung segmentation models in transplant-eligible patients to determine their performance across disease severity levels, pathology categories, and lung sides, and to identify limitations impacting their use in preoperative planning in lung transplantation. This retrospective study included 32 patients who underwent chest CT scans at Duke University Health System between 2017 and 2019 (total of 3,645 2D axial slices). Patients with standard axial CT scans were selected based on the presence of two or more lung pathologies of varying severity. Lung segmentation was performed using three previously developed deep learning models: Unet-R231, TotalSegmentator, MedSAM. Performance was assessed using quantitative metrics (volumetric similarity, Dice similarity coefficient, Hausdorff distance) and a qualitative measure (four-point clinical acceptability scale). Unet-R231 consistently outperformed TotalSegmentator and MedSAM in general, for different severity levels, and pathology categories (p<0.05). All models showed significant performance declines from mild to moderate-to-severe cases, particularly in volumetric similarity (p<0.05), without significant differences among lung sides or pathology types. Unet-R231 provided the most accurate automated lung segmentation among evaluated models with TotalSegmentator being a close second, though their performance declined significantly in moderate-to-severe cases, emphasizing the need for specialized model fine-tuning in severe pathology contexts.
中文: 本研究发现Unet-R231模型在不同疾病严重程度和病理类型中均优于其他深度学习肺部分割模型,但所有模型在中重度病例中性能均显著下降,强调了在严重病理情况下需要进行专门模型优化的必要性。
English: This study found that Unet-R231 outperformed other deep learning models in lung segmentation across various disease severities and pathologies, though all models showed significant performance declines in moderate-to-severe cases, highlighting the need for specialized fine-tuning in severe clinical contexts.

Authors:Jiaxing Miao, Liang Hu, Qi Zhang, Lai Zhong Yuan, Usman Naseem
Title: CUFG: Curriculum Unlearning Guided by the Forgetting Gradient
Abstract:
As privacy and security take center stage in AI, machine unlearning, the ability to erase specific knowledge from models, has garnered increasing attention. However, existing methods overly prioritize efficiency and aggressive forgetting, which introduces notable limitations. In particular, radical interventions like gradient ascent, influence functions, and random label noise can destabilize model weights, leading to collapse and reduced reliability. To address this, we propose CUFG (Curriculum Unlearning via Forgetting Gradients), a novel framework that enhances the stability of approximate unlearning through innovations in both forgetting mechanisms and data scheduling strategies. Specifically, CUFG integrates a new gradient corrector guided by forgetting gradients for fine-tuning-based unlearning and a curriculum unlearning paradigm that progressively forgets from easy to hard. These innovations narrow the gap with the gold-standard Retrain method by enabling more stable and progressive unlearning, thereby improving both effectiveness and reliability. Furthermore, we believe that the concept of curriculum unlearning has substantial research potential and offers forward-looking insights for the development of the MU field. Extensive experiments across various forgetting scenarios validate the rationale and effectiveness of our approach and CUFG. Codes are available at https://anonymous.4open.science/r/CUFG-6375.
中文摘要:CUFG框架通过遗忘梯度引导的课程遗忘机制,实现了稳定渐进的知识消除,在避免激进遗忘导致模型崩溃的同时,显著缩小了与黄金标准重训练方法的性能差距。
English Summary: The CUFG framework introduces curriculum unlearning with forgetting gradients to enable stable, progressive knowledge removal, bridging the performance gap with retraining while preventing model collapse from aggressive forgetting methods.

Authors:Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang
Title: Overview of the TREC 2024 NeuCLIR Track
Abstract:
The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR) track is to study the effect of neural approaches on cross-language information access. The track has created test collections containing Chinese, Persian, and Russian news stories and Chinese academic abstracts. NeuCLIR includes four task types: Cross-Language Information Retrieval (CLIR) from news, Multilingual Information Retrieval (MLIR) from news, Report Generation from news, and CLIR from technical documents. A total of 274 runs were submitted by five participating teams (and as baselines by the track coordinators) for eight tasks across these four task types. Task descriptions and the available results are presented.
中文: TREC NeuCLIR 项目旨在研究神经方法对跨语言信息检索的影响,通过构建中文、波斯语和俄语测试集,并在四大任务类型中评估了多支参赛队伍的提交结果。
English: The TREC NeuCLIR track investigates the impact of neural methods on cross-language information retrieval by creating test collections in Chinese, Persian, and Russian, and evaluating performance across four distinct task types with submissions from multiple teams.

Authors:Hasan Abed Al Kader Hammoud, Mohammad Zbeeb, Bernard Ghanem
Title: Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale
Abstract:
We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR$\leftrightarrow$EN teacher to FP8 (yielding $\sim$2$\times$ higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the "nano" ($\leq$2B) and "small" (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.
中文: Hala模型系列通过翻译调优流程开发了以阿拉伯语为核心的指令与翻译功能,在阿拉伯语基准测试中取得领先性能,并开源相关资源以推动阿拉伯语自然语言处理研究。
English: The Hala model family introduces Arabic-centric instruction and translation capabilities through a translate-and-tune pipeline, achieving state-of-the-art results on Arabic benchmarks while releasing resources to advance Arabic NLP research.

Authors:Dayeon Ki, Marine Carpuat, Paul McNamee, Daniel Khashabi, Eugene Yang, Dawn Lawrie, Kevin Duh
Title: Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG
Abstract:
Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.
中文: 多语言检索增强生成系统研究表明,语言模型在引用时倾向于优先选择英文来源,尤其对资源较少的语言和位于语境中段的文档,这种偏好有时甚至超过文档相关性,影响引用的客观性。
English: mRAG systems reveal that language models exhibit a bias toward citing English sources, particularly for lower-resource languages and mid-context documents, sometimes prioritizing language preference over document relevance in citation decisions.

Authors:Dayeon Ki, Marine Carpuat, Paul McNamee, Daniel Khashabi, Eugene Yang, Dawn Lawrie, Kevin Duh
Title: Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG
Abstract:
Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.
中文: 多语言检索增强生成系统研究表明,语言模型在引用时倾向于优先选择英文来源,尤其对资源较少的语言和位于语境中段的文档,这种偏好有时甚至超过文档相关性,影响引用的客观性。
English: mRAG systems reveal that language models exhibit a bias toward citing English sources, particularly for lower-resource languages and mid-context documents, sometimes prioritizing language preference over document relevance in citation decisions.

Authors:Bingshen Mu, Pengcheng Guo, Zhaokai Sun, Shuai Wang, Hexin Liu, Mingchen Shao, Lei Xie, Eng Siong Chng, Longshuai Xiao, Qiangze Feng, Daliang Wang
Title: Summary on The Multilingual Conversational Speech Language Model Challenge: Datasets, Tasks, Baselines, and Methods
Abstract:
This paper summarizes the Interspeech2025 Multilingual Conversational Speech Language Model (MLC-SLM) challenge, which aims to advance the exploration of building effective multilingual conversational speech LLMs (SLLMs). We provide a detailed description of the task settings for the MLC-SLM challenge, the released real-world multilingual conversational speech dataset totaling approximately 1,604 hours, and the baseline systems for participants. The MLC-SLM challenge attracts 78 teams from 13 countries to participate, with 489 valid leaderboard results and 14 technical reports for the two tasks. We distill valuable insights on building multilingual conversational SLLMs based on submissions from participants, aiming to contribute to the advancement of the community.
中文: Interspeech2025多语言对话语音语言模型挑战赛通过提供任务设置、1604小时数据集和基线系统,吸引了78支团队参与,其提交成果为构建多语言对话语音大模型提炼了重要经验。
English: The Interspeech2025 MLC-SLM challenge advances multilingual conversational speech language models by providing task specifications, a 1,604-hour dataset, and baseline systems, attracting 78 teams whose submissions yield valuable insights for the field.

Authors:Zhang Xueyao, Yang Bo, Yu Zhiwen, Cao Xuelin, George C. Alexandropoulos, Merouane Debbah, Chau Yuen
Title: Cooperative Target Detection with AUVs: A Dual-Timescale Hierarchical MARDL Approach
Abstract:
Autonomous Underwater Vehicles (AUVs) have shown great potential for cooperative detection and reconnaissance. However, collaborative AUV communications introduce risks of exposure. In adversarial environments, achieving efficient collaboration while ensuring covert operations becomes a key challenge for underwater cooperative missions. In this paper, we propose a novel dual time-scale Hierarchical Multi-Agent Proximal Policy Optimization (H-MAPPO) framework. The high-level component determines the individuals participating in the task based on a central AUV, while the low-level component reduces exposure probabilities through power and trajectory control by the participating AUVs. Simulation results show that the proposed framework achieves rapid convergence, outperforms benchmark algorithms in terms of performance, and maximizes long-term cooperative efficiency while ensuring covert operations.
中文: 本文提出了一种双时间尺度的H-MAPPO框架,通过中央AUV协调任务参与及个体AUV控制功率与轨迹来降低暴露风险,仿真表明该框架能确保隐蔽性同时实现高效协同作业。
English: The paper introduces a dual time-scale H-MAPPO framework that enables efficient and covert collaboration among AUVs by managing task participation and minimizing exposure through power and trajectory control, demonstrating superior performance in simulations.

Authors:Kento Murata, Shoichi Hasegawa, Tomochika Ishikawa, Yoshinobu Hagiwara, Akira Taniguchi, Lotfi El Hafi, Tadahiro Taniguchi
Title: Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models
Abstract:
It is crucial to efficiently execute instructions such as "Find an apple and a banana" or "Get ready for a field trip," which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as "Get ready for a field trip," by successfully performing task decomposition, assignment, sequential planning, and execution.
中文摘要:本研究提出了一种利用大语言模型和空间概念的任务规划框架,可将自然语言指令分解为子任务并分配给具有不同情境知识的多个机器人,在移动机械臂实验中展现出优越性能。
English Summary: This study introduces a task planning framework using large language models and spatial concepts to decompose natural language instructions into subtasks and assign them to multiple robots with different situational knowledge, achieving superior performance in experiments with mobile manipulators.

Authors:Kento Murata, Shoichi Hasegawa, Tomochika Ishikawa, Yoshinobu Hagiwara, Akira Taniguchi, Lotfi El Hafi, Tadahiro Taniguchi
Title: Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language Models
Abstract:
It is crucial to efficiently execute instructions such as "Find an apple and a banana" or "Get ready for a field trip," which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as "Get ready for a field trip," by successfully performing task decomposition, assignment, sequential planning, and execution.
中文摘要:本研究提出了一种利用大语言模型和空间概念的任务规划框架,可将自然语言指令分解为子任务并分配给具有不同情境知识的多个机器人,在移动机械臂实验中展现出优越性能。
English Summary: This study introduces a task planning framework using large language models and spatial concepts to decompose natural language instructions into subtasks and assign them to multiple robots with different situational knowledge, achieving superior performance in experiments with mobile manipulators.

Authors:Saki Hashimoto, Shoichi Hasegawa, Tomochika Ishikawa, Akira Taniguchi, Yoshinobu Hagiwara, Lotfi El Hafi, Tadahiro Taniguchi
Title: Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative Model
Abstract:
Robots operating in domestic and office environments must understand object ownership to correctly execute instructions such as ``Bring me my cup.'' However, ownership cannot be reliably inferred from visual features alone. To address this gap, we propose Active Ownership Learning (ActOwL), a framework that enables robots to actively generate and ask ownership-related questions to users. ActOwL employs a probabilistic generative model to select questions that maximize information gain, thereby acquiring ownership knowledge efficiently to improve learning efficiency. Additionally, by leveraging commonsense knowledge from Large Language Models (LLM), objects are pre-classified as either shared or owned, and only owned objects are targeted for questioning. Through experiments in a simulated home environment and a real-world laboratory setting, ActOwL achieved significantly higher ownership clustering accuracy with fewer questions than baseline methods. These findings demonstrate the effectiveness of combining active inference with LLM-guided commonsense reasoning, advancing the capability of robots to acquire ownership knowledge for practical and socially appropriate task execution.
中文:ActOwL框架通过概率模型主动生成所有权问题,并利用大语言模型的常识推理预筛选对象,在模拟和真实环境中以更少提问显著提升了所有权识别准确率。
English: To help robots learn object ownership efficiently, the ActOwL framework actively generates targeted questions using a probabilistic model and LLM-based commonsense reasoning, achieving higher accuracy with fewer queries in both simulated and real-world tests.

Authors:Hang Guo, Yawei Li, Luca Benini
Title: Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
Abstract:
Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.
Chinese Summary: 本研究提出最优大脑恢复(OBR)框架,通过误差补偿协调量化与稀疏化在权重分布上的冲突,无需训练即可实现大语言模型的强力压缩,相比基线获得4.72倍加速和6.4倍内存缩减。
English Summary: This study introduces Optimal Brain Restoration (OBR), a training-free framework that synergizes quantization and sparsity by compensating for their conflicting weight distribution requirements, achieving significant compression with up to 4.72x speedup and 6.4x memory reduction in LLMs.

Authors:Liqian Feng, Lintao Wang, Kun Hu, Dehui Kong, Zhiyong Wang
Title: Text2Sign Diffusion: A Generative Approach for Gloss-Free Sign Language Production
Abstract:
Sign language production (SLP) aims to translate spoken language sentences into a sequence of pose frames in a sign language, bridging the communication gap and promoting digital inclusion for deaf and hard-of-hearing communities. Existing methods typically rely on gloss, a symbolic representation of sign language words or phrases that serves as an intermediate step in SLP. This limits the flexibility and generalization of SLP, as gloss annotations are often unavailable and language-specific. Therefore, we present a novel diffusion-based generative approach - Text2Sign Diffusion (Text2SignDiff) for gloss-free SLP. Specifically, a gloss-free latent diffusion model is proposed to generate sign language sequences from noisy latent sign codes and spoken text jointly, reducing the potential error accumulation through a non-autoregressive iterative denoising process. We also design a cross-modal signing aligner that learns a shared latent space to bridge visual and textual content in sign and spoken languages. This alignment supports the conditioned diffusion-based process, enabling more accurate and contextually relevant sign language generation without gloss. Extensive experiments on the commonly used PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method, achieving the state-of-the-art performance.
中文: 本文提出Text2SignDiff方法,采用基于扩散的生成模型直接从口语生成手语序列,无需依赖中间符号标注,在标准数据集上实现了最优性能。
English: This paper introduces Text2SignDiff, a gloss-free sign language production method that uses a diffusion-based generative model to directly translate spoken language into sign sequences, eliminating the need for intermediate gloss annotations and achieving state-of-the-art performance on benchmark datasets.

Authors:Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Ningxin Zheng, Haibin Lin, Xin Liu, Minyi Guo
Title: Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution
Abstract:
Embodied AI systems operate in dynamic environments, requiring seamless integration of perception and generation modules to process high-frequency input and output demands. Traditional sequential computation patterns, while effective in ensuring accuracy, face significant limitations in achieving the necessary "thinking" frequency for real-world applications. In this work, we present Auras, an algorithm-system co-designed inference framework to optimize the inference frequency of embodied AI agents. Auras disaggregates the perception and generation and provides controlled pipeline parallelism for them to achieve high and stable throughput. Faced with the data staleness problem that appears when the parallelism is increased, Auras establishes a public context for perception and generation to share, thereby promising the accuracy of embodied agents. Experimental results show that Auras improves throughput by 2.54x on average while achieving 102.7% of the original accuracy, demonstrating its efficacy in overcoming the constraints of sequential computation and providing high throughput.
中文: Auras是一种算法与系统协同设计的推理框架,通过解耦感知与生成模块并采用可控流水线并行技术,将嵌入式AI代理的吞吐量平均提升2.54倍,同时保持原有准确率。
English: Auras is a co-designed algorithm-system framework that enhances embodied AI agents' throughput by disaggregating perception and generation modules with controlled pipeline parallelism, achieving 2.54x higher throughput while maintaining original accuracy.

Authors:Junjie Ni, Tong Wu, Zhiyong Chen, Yin Xu, Meixia Tao, Wenjun Zhang
Title: Mixture of Semantics Transmission for Generative AI-Enabled Semantic Communication Systems
Abstract:
In this paper, we propose a mixture of semantics (MoS) transmission strategy for wireless semantic communication systems based on generative artificial intelligence (AI). At the transmitter, we divide an image into regions of interest (ROI) and reigons of non-interest (RONI) to extract their semantic information respectively. Semantic information of ROI can be allocated more bandwidth, while RONI can be represented in a compact form for transmission. At the receiver, a diffusion model reconstructs the full image using the received semantic information of ROI and RONI. Compared to existing generative AI-based methods, MoS enables more efficient use of channel resources by balancing visual fidelity and semantic relevance. Experimental results demonstrate that appropriate ROI-RONI allocation is critical. The MoS achieves notable performance gains in peak signal-to-noise ratio (PSNR) of ROI and CLIP score of RONI.
中文: 本文提出了一种基于生成式人工智能的无线语义通信混合语义传输策略,通过对图像感兴趣与非感兴趣区域分别优化带宽分配,在提升视觉保真度和语义相关性的同时实现了更高效的资源利用。
English: This paper introduces a Mixture of Semantics (MoS) transmission strategy for wireless semantic communication, which optimizes bandwidth allocation between regions of interest and non-interest in images using generative AI, enhancing both visual fidelity and semantic relevance.

Authors:Hao Si, Ehsan Javanmardi, Manabu Tsukada
Title: You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception
Abstract:
Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.
中文: 渐进式异构协同感知(PHCP)框架通过推理过程中动态对齐特征,无需联合训练或标注数据,有效提升了协同感知能力,在多种异构场景下表现出色。
English: Collaborative perception is enhanced by the Progressive Heterogeneous Collaborative Perception (PHCP) framework, which dynamically aligns features during inference without requiring joint training or labeled data, achieving strong performance in diverse heterogeneous scenarios.

Authors:Amir Ivry, Samuele Cornell, Shinji Watanabe
Title: MAPSS: Manifold-based Assessment of Perceptual Source Separation
Abstract:
Objective assessment of source-separation systems still mismatches subjective human perception, especially when leakage and self-distortion interact. We introduce the Perceptual Separation (PS) and Perceptual Match (PM), the first pair of measures that functionally isolate these two factors. Our intrusive method begins with generating a bank of fundamental distortions for each reference waveform signal in the mixture. Distortions, references, and their respective system outputs from all sources are then independently encoded by a pre-trained self-supervised learning model. These representations are aggregated and projected onto a manifold via diffusion maps, which aligns Euclidean distances on the manifold with dissimilarities of the encoded waveforms. On this manifold, the PM measures the Mahalanobis distance from each output to its attributed cluster that consists of its reference and distortions embeddings, capturing self-distortion. The PS accounts for the Mahalanobis distance of the output to the attributed and to the closest non-attributed clusters, quantifying leakage. Both measures are differentiable and granular, operating at a resolution as low as 50 frames per second. We further derive, for both measures, deterministic error radius and non-asymptotic, high-probability confidence intervals (CIs). Experiments on English, Spanish, and music mixtures show that the PS and PM nearly always achieve the highest linear correlation coefficients with human mean-opinion scores than 14 competitors, reaching as high as 86.36% for speech and 87.21% for music. We observe, at worst, an error radius of 1.39% and a probabilistic 95% CI of 12.21% for these coefficients, which improves reliable and informed evaluation. Using mutual information, the measures complement each other most as their values decrease, suggesting they are jointly more informative as system performance degrades.
中文: 本研究提出了感知分离(PS)和感知匹配(PM),这是首个能功能性地分离源分离系统中泄漏和自失真的测量方法,实现了与人类感知的最高相关性,并提供具有稳健置信区间的细粒度、可微分评估。
English: The study introduces Perceptual Separation (PS) and Perceptual Match (PM), the first measures to functionally isolate leakage and self-distortion in source-separation systems, achieving the highest correlation with human perception and offering granular, differentiable evaluation with robust confidence intervals.

Authors:Farhad Nawaz, Faizan M. Tariq, Sangjae Bae, David Isele, Avinash Singh, Nadia Figueroa, Nikolai Matni, Jovin D'sa
Title: Occupancy-aware Trajectory Planning for Autonomous Valet Parking in Uncertain Dynamic Environments
Abstract:
Autonomous Valet Parking (AVP) requires planning under partial observability, where parking spot availability evolves as dynamic agents enter and exit spots. Existing approaches either rely only on instantaneous spot availability or make static assumptions, thereby limiting foresight and adaptability. We propose an approach that estimates probability of future spot occupancy by distinguishing initially vacant and occupied spots while leveraging nearby dynamic agent motion. We propose a probabilistic estimator that integrates partial, noisy observations from a limited Field-of-View, with the evolving uncertainty of unobserved spots. Coupled with the estimator, we design a strategy planner that balances goal-directed parking maneuvers with exploratory navigation based on information gain, and incorporates wait-and-go behaviors at promising spots. Through randomized simulations emulating large parking lots, we demonstrate that our framework significantly improves parking efficiency and trajectory smoothness over existing approaches, while maintaining safety margins.
中文摘要:本文提出了一种自主代客泊车的概率框架,通过结合局部观测与动态车辆移动来预测未来车位占用情况,采用平衡泊车操作与探索导航的策略规划,显著提升了泊车效率和轨迹平滑度。
English Summary: This paper introduces a probabilistic framework for Autonomous Valet Parking that estimates future parking spot occupancy by combining partial observations with dynamic agent movements, enabling strategic planning that balances parking maneuvers with exploratory navigation to significantly improve efficiency and trajectory smoothness.

Authors:Xusheng Zhu, Kai-Kit Wong, Hao Xu, Han Xiao, Hanjiang Hong, Hyundong Shin, Yangyang Zhang
Title: Fluid Antenna Systems: A Geometric Approach to Error Probability and Fundamental Limits
Abstract:
The fluid antenna system (FAS) concept is an emerging paradigm that promotes the utilization of the feature of shape and position reconfigurability in antennas to broaden the design of wireless communication systems. This also means that spatial diversity can be exploited in an unconventional way. However, a rigorous framework for error probability analysis of FAS under realistic spatially correlated channels has been lacking. In this paper, we fill this gap by deriving a tight, closed-form asymptotic expression for the symbol error rate (SER) that establishes the fundamental scaling law linking the system's SER to the channel's spatial correlation structure. A key insight of our analysis is that the achievable diversity gain is governed not by the number of antenna ports, but by the channel's effective rank. To find this critical parameter, we propose a novel dual-pronged approach. First of all, we develop a geometry-based algorithm that extracts distinct performance thresholds from the channel's eigenvalue spectrum. Second, we theoretically prove that the effective rank converges to a fundamental limit dictated solely by the antenna's normalized aperture width. We further establish the equivalence between the threshold identified by the geometric algorithm and the derived theoretical limit, providing rigorous validation for the proposed method. Our effective rank model achieves higher accuracy than existing approaches in the literature. Building on this framework, we offer a complete characterization of diversity and coding gains. The analysis leads to a definitive design insight: FAS performance improvements are fundamentally driven by enlarging the antenna's explorable aperture, which increases the effective channel rank, whereas increasing port density within a fixed aperture yields diminishing returns.
中文: 本文提出了一个严谨的分析框架,用于研究空间相关信道下流体天线系统的误码率,揭示其分集增益取决于由天线孔径宽度决定的信道有效秩,而非天线端口数量。
English: This paper introduces a rigorous framework for analyzing the symbol error rate of fluid antenna systems under spatially correlated channels, revealing that diversity gain depends on the channel's effective rank, which is determined by the antenna's aperture width rather than port count.

Authors:Marc Marone, Orion Weller, William Fleshman, Eugene Yang, Dawn Lawrie, Benjamin Van Durme
Title: mmBERT: A Modern Multilingual Encoder with Annealed Language Learning
Abstract:
Encoder-only languages models are frequently used for a variety of standard machine learning tasks, including classification and retrieval. However, there has been a lack of recent research for encoder models, especially with respect to multilingual models. We introduce mmBERT, an encoder-only language model pretrained on 3T tokens of multilingual text in over 1800 languages. To build mmBERT we introduce several novel elements, including an inverse mask ratio schedule and an inverse temperature sampling ratio. We add over 1700 low-resource languages to the data mix only during the decay phase, showing that it boosts performance dramatically and maximizes the gains from the relatively small amount of training data. Despite only including these low-resource languages in the short decay phase we achieve similar classification performance to models like OpenAI's o3 and Google's Gemini 2.5 Pro. Overall, we show that mmBERT significantly outperforms the previous generation of models on classification and retrieval tasks -- on both high and low-resource languages.
中文:mmBERT是一种创新的仅编码器多语言模型,通过在1800多种语言的大规模数据上预训练,并采用逆掩码比率调度和衰减阶段选择性纳入低资源语言等新技术,在分类和检索任务上实现了与顶尖模型相媲美的卓越性能。
English: mmBERT is a novel encoder-only multilingual model pretrained on extensive data across 1800+ languages, introducing innovative techniques like inverse mask ratio scheduling and selective low-resource language inclusion during decay phases to achieve superior classification and retrieval performance comparable to leading models.

Authors:George Ciubotariu, Florin-Alexandru Vasluianu, Zhuyun Zhou, Nancy Mehta, Radu Timofte, Ke Wu, Long Sun, Lingshun Kong, Zhongbao Yang, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Hao Chen, Yinghui Fang, Dafeng Zhang, Yongqi Song, Jiangbo Guo, Shuhua Jin, Zeyu Xiao, Rui Zhao, Zhuoyuan Li, Cong Zhang, Yufeng Peng, Xin Lu, Zhijing Sun, Chengjie Ge, Zihao Li, Zishun Liao, Ziang Zhou, Qiyu Kang, Xueyang Fu, Zheng-Jun Zha, Yuqian Zhang, Shuai Liu, Jie Liu, Zhuhao Zhang, Lishen Qu, Zhihao Liu, Shihao Zhou, Yaqi Luo, Juncheng Zhou, Jufeng Yang, Qianfeng Yang, Qiyuan Guan, Xiang Chen, Guiyue Jin, Jiyu Jin
Title: AIM 2025 Challenge on High FPS Motion Deblurring: Methods and Results
Abstract:
This paper presents a comprehensive review of the AIM 2025 High FPS Non-Uniform Motion Deblurring Challenge, highlighting the proposed solutions and final results. The objective of this challenge is to identify effective networks capable of producing clearer and visually compelling images in diverse and challenging conditions, by learning representative visual cues for complex aggregations of motion types. A total of 68 participants registered for the competition, and 9 teams ultimately submitted valid entries. This paper thoroughly evaluates the state-of-the-art advances in high-FPS single image motion deblurring, showcasing the significant progress in the field, while leveraging samples of the novel dataset, MIORe, that introduces challenging examples of movement patterns.
中文摘要:本文综述了AIM 2025高帧率非均匀运动去模糊挑战赛,通过新型MIORe数据集评估了能提升图像清晰度的先进解决方案,共有68支队伍注册,最终9支队伍提交有效成果。
English Summary: This paper reviews the AIM 2025 High FPS Non-Uniform Motion Deblurring Challenge, evaluating state-of-the-art solutions that enhance image clarity using the novel MIORe dataset, with 68 registrants and 9 final submissions.

Authors:Mohsine El Khayati, Ayyad Maafiri, Yassine Himeur, Hamzah Ali Alkhazaleh, Shadi Atalla, Wathiq Mansoor
Title: Leveraging Transfer Learning and Mobile-enabled Convolutional Neural Networks for Improved Arabic Handwritten Character Recognition
Abstract:
The study explores the integration of transfer learning (TL) with mobile-enabled convolutional neural networks (MbNets) to enhance Arabic Handwritten Character Recognition (AHCR). Addressing challenges like extensive computational requirements and dataset scarcity, this research evaluates three TL strategies--full fine-tuning, partial fine-tuning, and training from scratch--using four lightweight MbNets: MobileNet, SqueezeNet, MnasNet, and ShuffleNet. Experiments were conducted on three benchmark datasets: AHCD, HIJJA, and IFHCDB. MobileNet emerged as the top-performing model, consistently achieving superior accuracy, robustness, and efficiency, with ShuffleNet excelling in generalization, particularly under full fine-tuning. The IFHCDB dataset yielded the highest results, with 99% accuracy using MnasNet under full fine-tuning, highlighting its suitability for robust character recognition. The AHCD dataset achieved competitive accuracy (97%) with ShuffleNet, while HIJJA posed significant challenges due to its variability, achieving a peak accuracy of 92% with ShuffleNet. Notably, full fine-tuning demonstrated the best overall performance, balancing accuracy and convergence speed, while partial fine-tuning underperformed across metrics. These findings underscore the potential of combining TL and MbNets for resource-efficient AHCR, paving the way for further optimizations and broader applications. Future work will explore architectural modifications, in-depth dataset feature analysis, data augmentation, and advanced sensitivity analysis to enhance model robustness and generalizability.
中文: 本研究证明,将迁移学习与移动优化的卷积神经网络相结合可显著提升阿拉伯语手写字符识别效果,其中MobileNet通过完全微调实现最佳性能,同时保持了计算效率。
English: This research demonstrates that combining transfer learning with mobile-optimized convolutional neural networks significantly improves Arabic handwritten character recognition, with MobileNet achieving top performance through full fine-tuning while maintaining computational efficiency.

Authors:Mohammad Abbadi, Yassine Himeur, Shadi Atalla, Wathiq Mansoor
Title: Interpretable Deep Transfer Learning for Breast Ultrasound Cancer Detection: A Multi-Dataset Study
Abstract:
Breast cancer remains a leading cause of cancer-related mortality among women worldwide. Ultrasound imaging, widely used due to its safety and cost-effectiveness, plays a key role in early detection, especially in patients with dense breast tissue. This paper presents a comprehensive study on the application of machine learning and deep learning techniques for breast cancer classification using ultrasound images. Using datasets such as BUSI, BUS-BRA, and BrEaST-Lesions USG, we evaluate classical machine learning models (SVM, KNN) and deep convolutional neural networks (ResNet-18, EfficientNet-B0, GoogLeNet). Experimental results show that ResNet-18 achieves the highest accuracy (99.7%) and perfect sensitivity for malignant lesions. Classical ML models, though outperformed by CNNs, achieve competitive performance when enhanced with deep feature extraction. Grad-CAM visualizations further improve model transparency by highlighting diagnostically relevant image regions. These findings support the integration of AI-based diagnostic tools into clinical workflows and demonstrate the feasibility of deploying high-performing, interpretable systems for ultrasound-based breast cancer detection.
中文: 本研究证明机器学习和深度学习模型,特别是准确率达99.7%的ResNet-18,能有效通过超声图像进行乳腺癌分类,为可解释性AI工具融入临床诊断提供了有力支持。
English: This study demonstrates that machine and deep learning models, particularly ResNet-18 achieving 99.7% accuracy, effectively classify breast cancer in ultrasound images, supporting the integration of interpretable AI tools into clinical diagnostics.

Authors:Pranav Narayanan Venkit, Philippe Laban, Yilun Zhou, Kung-Hsiang Huang, Yixin Mao, Chien-Sheng Wu
Title: DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
Abstract:
Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.
中文: 生成式搜索引擎和深度研究代理常产生过度自信、片面且引用不准确的回答,为此我们开发了DeepTRACE审计框架,通过自动化分析和验证的大模型评判,从八个维度对这些系统进行全面评估。
English: Generative search engines and deep research agents often produce overly confident, one-sided responses with poor citation accuracy and unsupported statements, prompting the development of DeepTRACE, an audit framework that evaluates these systems across eight dimensions using automated analysis and validated LLM-judges.

Authors:Mohammad Zbeeb, Hasan Abed Al Kader Hammoud, Bernard Ghanem
Title: Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic
Abstract:
Large language models often require costly optimization, such as reinforcement learning, to master complex reasoning tasks. This work demonstrates that reasoning ability, once learned, can be extracted and transferred between models as a compact task vector. We source two publicly available, identically initialized Qwen2.5 models, one fine-tuned with supervised fine-tuning (SFT) and the other with group relative policy optimization (GRPO) on the same dataset. From these, we extract a reasoning vector: $v_{\text{reason}} = θ_{\text{GRPO}} - θ_{\text{SFT}}$. We hypothesize that this vector captures the reasoning capability instilled by reinforcement learning while factoring out shared knowledge from the SFT process. When added to compatible instruction-tuned models through simple arithmetic, this vector consistently improves performance across diverse reasoning benchmarks: GSM8K (+4.9%), HumanEval (+4.3%), SciQ (+1.7%), and BigBenchHard (+12.3% for the 1.5B model). The performance improvements persist under adversarial conditions. Conversely, subtracting the vector causes significant performance degradation (-11.8% on GSM8K), demonstrating the vector's strong contribution to the model's reasoning abilities. This work shows how reasoning capabilities, typically developed through expensive training, can be extracted from existing open-source models and reused through simple tensor arithmetic, offering a practical way to enhance models by recycling prior computational investments.
中文: 该研究表明,通过强化学习获得的推理能力可被提取为紧凑任务向量,并通过简单算术运算迁移至其他模型,在多个基准测试中持续提升性能,同时提供了一种重复利用计算投资的经济有效方法。
English: This research demonstrates that reasoning capabilities acquired through reinforcement learning can be extracted as a compact task vector and transferred to other models via simple arithmetic, consistently enhancing performance across multiple benchmarks while offering a cost-effective method to reuse computational investments.

Authors:Xihao Yuan, Siqi Liu, Yan Chen, Hang Zhou, Chang Liu, Hanting Chen, Jie Hu
Title: SaD: A Scenario-Aware Discriminator for Speech Enhancement
Abstract:
Generative adversarial network-based models have shown remarkable performance in the field of speech enhancement. However, the current optimization strategies for these models predominantly focus on refining the architecture of the generator or enhancing the quality evaluation metrics of the discriminator. This approach often overlooks the rich contextual information inherent in diverse scenarios. In this paper, we propose a scenario-aware discriminator that captures scene-specific features and performs frequency-domain division, thereby enabling a more accurate quality assessment of the enhanced speech generated by the generator. We conducted comprehensive experiments on three representative models using two publicly available datasets. The results demonstrate that our method can effectively adapt to various generator architectures without altering their structure, thereby unlocking further performance gains in speech enhancement across different scenarios.
中文: 生成对抗网络在语音增强中通过引入场景感知判别器,捕捉场景特征并执行频域划分,无需改动生成器结构即可提升不同场景下的语音增强性能。
English: Generative adversarial networks for speech enhancement are improved by a scenario-aware discriminator that captures scene-specific features and performs frequency-domain analysis, enabling better performance across various scenarios without modifying generator architectures.

Authors:Junsong Pu, Yichen Li, Zhuangbin Chen, Jinyang Liu, Zhihan Jiang, Jianjun Chen, Rui Shi, Zibin Zheng, Tieying Zhang
Title: ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems
Abstract:
Reliability management in cloud service systems is challenging due to the cascading effect of failures. Error wrapping, a practice prevalent in modern microservice development, enriches errors with context at each layer of the function call stack, constructing an error chain that describes a failure from its technical origin to its business impact. However, this also presents a significant traceability problem when recovering the complete error propagation path from the final log message back to its source. Existing approaches are ineffective at addressing this problem. To fill this gap, we present ErrorPrism in this work for automated reconstruction of error propagation paths in production microservice systems. ErrorPrism first performs static analysis on service code repositories to build a function call graph and map log strings to relevant candidate functions. This significantly reduces the path search space for subsequent analysis. Then, ErrorPrism employs an LLM agent to perform an iterative backward search to accurately reconstruct the complete, multi-hop error path. Evaluated on 67 production microservices at ByteDance, ErrorPrism achieves 97.0% accuracy in reconstructing paths for 102 real-world errors, outperforming existing static analysis and LLM-based approaches. ErrorPrism provides an effective and practical tool for root cause analysis in industrial microservice systems.
中文: ErrorPrism通过静态分析构建函数调用图,并利用LLM代理进行迭代反向搜索,有效解决了微服务中错误传播路径的追溯难题,在真实错误重构中达到97.0%的准确率。
English: ErrorPrism addresses the challenge of tracing error propagation in microservices by combining static analysis to build a function call graph and using an LLM agent for iterative backward search, achieving 97.0% accuracy in reconstructing error paths.

Authors:Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok
Title: Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search
Abstract:
Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.
Chinese Summary: 本文提出PACO框架,通过蒙特卡洛树搜索的自适应规划动态优化属性控制顺序,无需训练即可实现多属性可控摘要,在不同模型上均优于现有方法并展现卓越的控制性能。
English Summary: The paper introduces PACO, a training-free framework that uses adaptive planning with Monte Carlo Tree Search to dynamically optimize the order of attribute adjustments in controllable summarization, achieving robust multi-attribute control across diverse models and outperforming existing methods.

Authors:Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Yan Teng, Xingjun Ma, Yingchun Wang
Title: SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs
Abstract:
The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework SafeEvalAgent, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. SafeEvalAgent leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of SafeEvalAgent, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.
中文: 本文提出SafeEvalAgent动态多智能体框架,通过自主演化安全基准持续揭示大语言模型深层漏洞,实验证明静态评估存在严重局限——如GPT-5在欧盟AI法案下的安全率从72.50%骤降至36.36%,凸显动态评估对保障AI安全部署的紧迫性。
English: This paper introduces SafeEvalAgent, a dynamic multi-agent framework that continuously evolves safety benchmarks through self-learning to expose critical vulnerabilities in large language models, demonstrating how static evaluations fail to capture escalating risks as shown by GPT-5's safety rate dropping from 72.50% to 36.36% under progressive testing.

Authors:Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, Xiaojian Wu
Title: Mem-α: Learning Memory Construction via Reinforcement Learning
Abstract:
Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.
中文: 本文提出Mem-alpha强化学习框架,通过交互反馈训练大语言模型智能体有效管理复杂记忆系统,在性能显著提升的同时,对超出训练长度13倍以上的序列展现出卓越的泛化能力。
English: This paper introduces Mem-alpha, a reinforcement learning framework that trains large language model agents to effectively manage complex memory systems through interaction and feedback, achieving significant performance improvements and remarkable generalization to sequences over 13 times longer than training data.

Authors:Zhihan Jiang, Jinyang Liu, Yichen Li, Haiyu Huang, Xiao He, Tieying Zhang, Jianjun Chen, Yi Li, Rui Shi, Michael R. Lyu
Title: LogPilot: Intent-aware and Scalable Alert Diagnosis for Large-scale Online Service Systems
Abstract:
Effective alert diagnosis is essential for ensuring the reliability of large-scale online service systems. However, on-call engineers are often burdened with manually inspecting massive volumes of logs to identify root causes. While various automated tools have been proposed, they struggle in practice due to alert-agnostic log scoping and the inability to organize complex data effectively for reasoning. To overcome these limitations, we introduce LogPilot, an intent-aware and scalable framework powered by Large Language Models (LLMs) for automated log-based alert diagnosis. LogPilot introduces an intent-aware approach, interpreting the logic in alert definitions (e.g., PromQL) to precisely identify causally related logs and requests. To achieve scalability, it reconstructs each request's execution into a spatiotemporal log chain, clusters similar chains to identify recurring execution patterns, and provides representative samples to the LLMs for diagnosis. This clustering-based approach ensures the input is both rich in diagnostic detail and compact enough to fit within the LLM's context window. Evaluated on real-world alerts from Volcano Engine Cloud, LogPilot improves the usefulness of root cause summarization by 50.34% and exact localization accuracy by 54.79% over state-of-the-art methods. With a diagnosis time under one minute and a cost of only $0.074 per alert, LogPilot has been successfully deployed in production, offering an automated and practical solution for service alert diagnosis.
Chinese: LogPilot是一种基于大语言模型的意图感知可扩展框架,通过精准识别相关日志和聚类执行模式来自动化告警诊断,相比现有方法显著提升了根因分析的准确性和效率。
English: LogPilot is an intent-aware, scalable framework using Large Language Models to automate log-based alert diagnosis by precisely identifying relevant logs and clustering execution patterns, significantly improving root cause analysis efficiency and accuracy over existing methods.

Authors:Leonardo Defilippis, Yizhou Xu, Julius Girardin, Emanuele Troiani, Vittorio Erba, Lenka Zdeborová, Bruno Loureiro, Florent Krzakala
Title: Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime
Abstract:
Neural scaling laws underlie many of the recent advances in deep learning, yet their theoretical understanding remains largely confined to linear models. In this work, we present a systematic analysis of scaling laws for quadratic and diagonal neural networks in the feature learning regime. Leveraging connections with matrix compressed sensing and LASSO, we derive a detailed phase diagram for the scaling exponents of the excess risk as a function of sample complexity and weight decay. This analysis uncovers crossovers between distinct scaling regimes and plateau behaviors, mirroring phenomena widely reported in the empirical neural scaling literature. Furthermore, we establish a precise link between these regimes and the spectral properties of the trained network weights, which we characterize in detail. As a consequence, we provide a theoretical validation of recent empirical observations connecting the emergence of power-law tails in the weight spectrum with network generalization performance, yielding an interpretation from first principles.
中文摘要:本研究对二次和对角神经网络在特征学习中的缩放规律进行了理论分析,揭示了与经验观察相符的相变和平台行为,并建立了权重谱特性与泛化性能之间的理论联系。
English Summary: This study theoretically analyzes neural scaling laws for quadratic and diagonal networks in feature learning, revealing phase transitions and plateau behaviors that align with empirical observations while establishing connections between weight spectrum properties and generalization performance.

Authors:Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen
Title: UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark
Abstract:
Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.
Chinese: UI2V-Bench是一个新颖的基准测试,旨在通过关注四个关键维度的语义理解和推理能力来评估图像到视频模型,弥补了现有基准在此方面的不足。
English: UI2V-Bench is a new benchmark designed to evaluate Image-to-Video models by focusing on semantic understanding and reasoning across four key dimensions, addressing the current lack of such assessments in existing benchmarks.

Authors:Ruibo Chen, Sheng Zhang, Yihan Wu, Tong Zheng, Peihua Mai, Heng Huang
Title: Model Correlation Detection via Random Selection Probing
Abstract:
The growing prevalence of large language models (LLMs) and vision-language models (VLMs) has heightened the need for reliable techniques to determine whether a model has been fine-tuned from or is even identical to another. Existing similarity-based methods often require access to model parameters or produce heuristic scores without principled thresholds, limiting their applicability. We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test. RSP optimizes textual or visual prefixes on a reference model for a random selection task and evaluates their transferability to a target model, producing rigorous p-values that quantify evidence of correlation. To mitigate false positives, RSP incorporates an unrelated baseline model to filter out generic, transferable features. We evaluate RSP across both LLMs and VLMs under diverse access conditions for reference models and test models. Experiments on fine-tuned and open-source models show that RSP consistently yields small p-values for related models while maintaining high p-values for unrelated ones. Extensive ablation studies further demonstrate the robustness of RSP. These results establish RSP as the first principled and general statistical framework for model correlation detection, enabling transparent and interpretable decisions in modern machine learning ecosystems.
中文摘要:本文提出随机选择探测(RSP)这一统计框架,通过测试前缀可迁移性并生成严谨的p值来检测模型间关联性,能在不同访问条件下有效区分相关模型与无关模型。
English Summary: The paper introduces Random Selection Probing (RSP), a statistical framework that detects correlations between models by testing prefix transferability and generating rigorous p-values, effectively distinguishing related from unrelated models across various access conditions.

Authors:Arpit Garg, Hemanth Saratchandran, Ravi Garg, Simon Lucey
Title: Stable Forgetting: Bounded Parameter-Efficient Unlearning in LLMs
Abstract:
Machine unlearning in large language models (LLMs) is essential for privacy and safety; however, existing approaches remain unstable and unreliable. A widely used strategy, the gradient difference method, applies gradient descent on retained data while performing gradient ascent on forget data, the data whose influence should be removed. However, when combined with cross-entropy loss, this procedure causes unbounded growth of weights and gradients, leading to training instability and degrading both forgetting and retention. We provide a theoretical framework that explains this failure, explicitly showing how ascent on the forget set destabilizes optimization in the feedforward MLP layers of LLMs. Guided by this insight, we propose Bounded Parameter-Efficient Unlearning, a parameter-efficient approach that stabilizes LoRA-based fine-tuning by applying bounded functions to MLP adapters. This simple modification controls the weight dynamics during ascent, enabling the gradient difference method to converge reliably. Across the TOFU, TDEC, and MUSE benchmarks, and across architectures and scales from 125M to 8B parameters, our method achieves substantial improvements in forgetting while preserving retention, establishing a novel theoretically grounded and practically scalable framework for unlearning in LLMs.
中文摘要:本研究揭示了大型语言模型机器遗忘中因梯度无界增长导致的不稳定性,并提出一种参数高效方法,通过有界函数稳定训练,在多种基准测试和模型规模下显著提升遗忘效果同时保持模型记忆能力。
English Summary: This study identifies the instability in machine unlearning for LLMs caused by unbounded gradient growth and introduces a parameter-efficient method that stabilizes training through bounded functions, significantly improving forgetting performance while maintaining model retention across various benchmarks and scales.

Authors:Yihan Wu, Xuehao Cui, Ruibo Chen, Heng Huang
Title: Analyzing and Evaluating Unbiased Language Model Watermark
Abstract:
Verifying the authenticity of AI-generated text has become increasingly important with the rapid advancement of large language models, and unbiased watermarking has emerged as a promising approach due to its ability to preserve output distribution without degrading quality. However, recent work reveals that unbiased watermarks can accumulate distributional bias over multiple generations and that existing robustness evaluations are inconsistent across studies. To address these issues, we introduce UWbench, the first open-source benchmark dedicated to the principled evaluation of unbiased watermarking methods. Our framework combines theoretical and empirical contributions: we propose a statistical metric to quantify multi-batch distribution drift, prove an impossibility result showing that no unbiased watermark can perfectly preserve the distribution under infinite queries, and develop a formal analysis of robustness against token-level modification attacks. Complementing this theory, we establish a three-axis evaluation protocol: unbiasedness, detectability, and robustness, and show that token modification attacks provide more stable robustness assessments than paraphrasing-based methods. Together, UWbench offers the community a standardized and reproducible platform for advancing the design and evaluation of unbiased watermarking algorithms.
中文: 该摘要介绍了UWbench,这是一个全面的开源基准,旨在通过理论分析和三轴评估协议,解决无偏水印方法在AI生成文本中分布偏差累积和鲁棒性评估不一致的问题。
English: The abstract introduces UWbench, a comprehensive open-source benchmark designed to evaluate unbiased watermarking methods for AI-generated text by addressing distributional bias accumulation and inconsistent robustness assessments through theoretical analysis and a three-axis evaluation protocol.

Authors:Yihan Wu, Ruibo Chen, Georgios Milis, Heng Huang
Title: An Ensemble Framework for Unbiased Language Model Watermarking
Abstract:
As large language models become increasingly capable and widely deployed, verifying the provenance of machine-generated content is critical to ensuring trust, safety, and accountability. Watermarking techniques have emerged as a promising solution by embedding imperceptible statistical signals into the generation process. Among them, unbiased watermarking is particularly attractive due to its theoretical guarantee of preserving the language model's output distribution, thereby avoiding degradation in fluency or detectability through distributional shifts. However, existing unbiased watermarking schemes often suffer from weak detection power and limited robustness, especially under short text lengths or distributional perturbations. In this work, we propose ENS, a novel ensemble framework that enhances the detectability and robustness of logits-based unbiased watermarks while strictly preserving their unbiasedness. ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal. We theoretically prove that the ensemble construction remains unbiased in expectation and demonstrate how it improves the signal-to-noise ratio for statistical detectors. Empirical evaluations on multiple LLM families show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks without compromising generation quality.
中文: ENS框架通过顺序组合多个水印实例,在保持无偏性的同时增强了大型语言模型无偏水印的可检测性和鲁棒性,有效提升了抗攻击的检测性能。
English: The ENS framework enhances the detectability and robustness of unbiased watermarking for large language models by sequentially combining multiple watermark instances, preserving unbiasedness while improving detection performance against attacks.

Authors:Yixu Wang, Yan Teng, Yingchun Wang, Xingjun Ma
Title: StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have transformed vision model adaptation, enabling the rapid deployment of customized models. However, the compactness of LoRA adaptations introduces new safety concerns, particularly their vulnerability to model extraction attacks. This paper introduces a new focus of model extraction attacks named LoRA extraction that extracts LoRA-adaptive models based on a public pre-trained model. We then propose a novel extraction method called StolenLoRA which trains a substitute model to extract the functionality of a LoRA-adapted model using synthetic data. StolenLoRA leverages a Large Language Model to craft effective prompts for data generation, and it incorporates a Disagreement-based Semi-supervised Learning (DSL) strategy to maximize information gain from limited queries. Our experiments demonstrate the effectiveness of StolenLoRA, achieving up to a 96.60% attack success rate with only 10k queries, even in cross-backbone scenarios where the attacker and victim models utilize different pre-trained backbones. These findings reveal the specific vulnerability of LoRA-adapted models to this type of extraction and underscore the urgent need for robust defense mechanisms tailored to PEFT methods. We also explore a preliminary defense strategy based on diversified LoRA deployments, highlighting its potential to mitigate such attacks.
中文摘要:本文揭示了LoRA适配模型面临的新型提取攻击风险,提出的StolenLoRA方法仅用少量查询即可高效窃取模型功能,同时探讨了通过多样化部署的初步防御方案。
English Summary: This paper introduces LoRA extraction attacks, demonstrating how the StolenLoRA method can successfully steal LoRA-adapted model functionality with high success rates using minimal queries, while proposing preliminary defense strategies.

Authors:Kaishuai Xu, Wenjun Hou, Yi Cheng, Wenjie Li
Title: RAR$^2$: Retrieval-Augmented Medical Reasoning via Thought-Driven Retrieval
Abstract:
Large Language Models (LLMs) have shown promising performance on diverse medical benchmarks, highlighting their potential in supporting real-world clinical tasks. Retrieval-Augmented Generation (RAG) has emerged as a key approach for mitigating knowledge gaps and hallucinations by incorporating external medical information. However, RAG still struggles with complex medical questions that require intensive reasoning, as surface-level input often fails to reflect the true knowledge needs of the task. Existing methods typically focus on refining queries without explicitly modeling the reasoning process, limiting their ability to retrieve and integrate clinically relevant knowledge. In this work, we propose RAR$^2$, a joint learning framework that improves both Reasoning-Augmented Retrieval and Retrieval-Augmented Reasoning. RAR$^2$ constructs a thought process to uncover implicit knowledge requirements and uses it to guide retrieval and answer generation. We build a training dataset of mixed preference pairs and apply Direct Preference Optimization (DPO) to train the model. Moreover, we design two test-time scaling strategies to explore the boundaries of our framework. Experiments demonstrate the effectiveness of RAR$^2$ across several biomedical question answering datasets, outperforming RAG baselines with or without fine-tuning.
大语言模型在临床应用中展现出潜力,检索增强生成有助于弥补知识差距,但在处理复杂推理密集型医学问题时仍存在不足,因此提出了RAR²框架,通过联合学习增强推理检索和检索推理,在生物医学问答任务中表现优于现有方法。
Large Language Models show potential in clinical applications, and Retrieval-Augmented Generation helps address knowledge gaps, yet struggles with complex reasoning-intensive medical questions, leading to the proposed RAR² framework that enhances both reasoning-augmented retrieval and retrieval-augmented reasoning, outperforming existing methods in biomedical QA tasks.

Authors:Mu Huang, Linning Xu, Mingyue Dai, Yidi Shao, Bo Dai
Title: Reversible GNS for Dissipative Fluids with Consistent Bidirectional Dynamics
Abstract:
Simulating physically plausible trajectories toward user-defined goals is a fundamental yet challenging task in fluid dynamics. While particle-based simulators can efficiently reproduce forward dynamics, inverse inference remains difficult, especially in dissipative systems where dynamics are irreversible and optimization-based solvers are slow, unstable, and often fail to converge. In this work, we introduce the Reversible Graph Network Simulator (R-GNS), a unified framework that enforces bidirectional consistency within a single graph architecture. Unlike prior neural simulators that approximate inverse dynamics by fitting backward data, R-GNS does not attempt to reverse the underlying physics. Instead, we propose a mathematically invertible design based on residual reversible message passing with shared parameters, coupling forward dynamics with inverse inference to deliver accurate predictions and efficient recovery of plausible initial states. Experiments on three dissipative benchmarks (Water-3D, WaterRamps, and WaterDrop) show that R-GNS achieves higher accuracy and consistency with only one quarter of the parameters, and performs inverse inference more than 100 times faster than optimization-based baselines. For forward simulation, R-GNS matches the speed of strong GNS baselines, while in goal-conditioned tasks it eliminates iterative optimization and achieves orders-of-magnitude speedups. On goal-conditioned tasks, R-GNS further demonstrates its ability to complex target shapes (e.g., characters "L" and "N") through vivid, physically consistent trajectories. To our knowledge, this is the first reversible framework that unifies forward and inverse simulation for dissipative fluid systems.
中文摘要:可逆图网络模拟器(R-GNS)提出了一种统一的双向流体模拟框架,在耗散系统的正向动力学和逆向推理中均实现了高精度与高效率,其性能与速度显著超越现有方法。
English Summary: The Reversible Graph Network Simulator (R-GNS) introduces a unified framework for bidirectional fluid simulation that achieves high accuracy and efficiency in both forward dynamics and inverse inference for dissipative systems, significantly outperforming existing methods in speed and performance.

Authors:Ziyun Cui, Sike Jia, Yang Lin, Yinan Duan, Diyang Qu, Runsen Chen, Chao Zhang, Chang Lei, Wen Wu
Title: Speaker Anonymisation for Speech-based Suicide Risk Detection
Abstract:
Adolescent suicide is a critical global health issue, and speech provides a cost-effective modality for automatic suicide risk detection. Given the vulnerable population, protecting speaker identity is particularly important, as speech itself can reveal personally identifiable information if the data is leaked or maliciously exploited. This work presents the first systematic study of speaker anonymisation for speech-based suicide risk detection. A broad range of anonymisation methods are investigated, including techniques based on traditional signal processing, neural voice conversion, and speech synthesis. A comprehensive evaluation framework is built to assess the trade-off between protecting speaker identity and preserving information essential for suicide risk detection. Results show that combining anonymisation methods that retain complementary information yields detection performance comparable to that of original speech, while achieving protection of speaker identity for vulnerable populations.
中文摘要:本研究首次系统评估了基于语音的自杀风险检测中的说话人匿名化技术,证明结合互补性匿名方法可在保护说话人身份的同时,保持与原语音相当的风险检测性能。
English Summary: This study pioneers a systematic evaluation of speaker anonymization techniques for speech-based suicide risk detection, demonstrating that combining complementary methods can protect speaker identity without compromising detection performance comparable to original speech.

Authors:Zihuan Qiu, Yi Xu, Fanman Meng, Runtong Zhang, Linfeng Xu, Qingbo Wu, Hongliang Li
Title: Closing the Oracle Gap: Increment Vector Transformation for Class Incremental Learning
Abstract:
Class Incremental Learning (CIL) aims to sequentially acquire knowledge of new classes without forgetting previously learned ones. Despite recent progress, current CIL methods still exhibit significant performance gaps compared to their oracle counterparts-models trained with full access to historical data. Inspired by recent insights on Linear Mode Connectivity (LMC), we revisit the geometric properties of oracle solutions in CIL and uncover a fundamental observation: these oracle solutions typically maintain low-loss linear connections to the optimum of previous tasks. Motivated by this finding, we propose Increment Vector Transformation (IVT), a novel plug-and-play framework designed to mitigate catastrophic forgetting during training. Rather than directly following CIL updates, IVT periodically teleports the model parameters to transformed solutions that preserve linear connectivity to previous task optimum. By maintaining low-loss along these connecting paths, IVT effectively ensures stable performance on previously learned tasks. The transformation is efficiently approximated using diagonal Fisher Information Matrices, making IVT suitable for both exemplar-free and exemplar-based scenarios, and compatible with various initialization strategies. Extensive experiments on CIFAR-100, FGVCAircraft, ImageNet-Subset, and ImageNet-Full demonstrate that IVT consistently enhances the performance of strong CIL baselines. Specifically, on CIFAR-100, IVT improves the last accuracy of the PASS baseline by +5.12% and reduces forgetting by 2.54%. For the CLIP-pre-trained SLCA baseline on FGVCAircraft, IVT yields gains of +14.93% in average accuracy and +21.95% in last accuracy. The code will be released.
中文: 本文提出增量向量变换(IVT)框架,通过保持与先前任务最优解的线性连接来缓解类增量学习中的灾难性遗忘,在多个数据集上显著提升了基线模型的性能表现。
English: The paper introduces Increment Vector Transformation (IVT), a plug-and-play framework that leverages linear connectivity to previous task optima to mitigate catastrophic forgetting in Class Incremental Learning, significantly enhancing baseline performance across multiple datasets.

Authors:Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Title: Null-Space Filtering for Data-Free Continual Model Merging: Preserving Transparency, Promoting Fidelity
Abstract:
Data-free continual model merging (DFCMM) aims to fuse independently fine-tuned models into a single backbone that evolves with incoming tasks without accessing task data. This paper formulate two fundamental desiderata for DFCMM: transparency, avoiding interference with earlier tasks, and fidelity, adapting faithfully to each new task. This poses a challenge that existing approaches fail to address: how to bridge data-level desiderata with parameter-space optimization to ensure transparency and fidelity in the absence of task data. To this end, we propose NUFILT (NUll-space FILTering), a data-free framework that directly links these desiderata to optimization. Our key observation is that task vectors approximately align with representation subspaces, providing structural surrogates for enforcing transparency and fidelity. Accordingly, we design a null-space projector that preserves prior responses by filtering out overlapping components of new task vectors, thereby ensuring transparency, and a lightweight LoRA adapter that injects complementary task-specific signals, enabling fidelity in adapting to new tasks. The adapter is trained with a projection-based surrogate loss to retain consistency with previous knowledge while introducing novel directions. This joint filtering-adaptation process allows the backbone to absorb new knowledge while retaining existing behaviors, and the updates are finally fused back in a layer-wise linear fashion without extra parameters or inference cost. Theoretically, we establish approximate subspace alignment guarantees that justify null-space filtering. Empirically, NUFILT achieves state-of-the-art performance with minimal forgetting on both vision and NLP benchmarks, improving average accuracy by 4-7% over OPCM and WUDI-Merging, while narrowing the gap to fine-tuning and reducing computation overhead.
中文: 本文提出了NUFILT框架,通过过滤重叠任务向量成分确保透明度,并利用轻量适配器实现保真度,在无需任务数据的情况下实现了最先进的持续模型融合性能且遗忘最小。
English: This paper introduces NUFILT, a data-free framework for continual model merging that ensures transparency by filtering overlapping task vector components and fidelity through lightweight adapters, achieving state-of-the-art performance with minimal forgetting.

Authors:Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong
Title: Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction
Abstract:
Spoken dialogue models have significantly advanced intelligent human\textendash computer interaction, yet they lack a plug\textendash and\textendash play full\textendash duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix\textendashVAD, an LLM\textendash based model that enables streaming semantic endpoint detection. Specifically, Phoenix\textendash VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix\textendash VAD achieves excellent and competitive performance. Furthermore, this design enables the full\textendash duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next\textendash generation human\textendash computer interaction.
中文: Phoenix-VAD是一种基于大语言模型的新型流式语义端点检测方法,通过语义理解和滑动窗口训练实现高性能全双工人机交互,在不同语音场景下均表现出色。
English: Phoenix-VAD is a novel LLM-based model that enables streaming semantic endpoint detection for seamless full-duplex human-computer interaction, achieving competitive performance across various speech scenarios.

Authors:Weijie Wu, Wenhao Guan, Kaidi Wang, Peijie Chen, Zhuanling Zha, Junbo Li, Jun Fang, Lin Li, Qingyang Hong
Title: Phoenix-VAD: Streaming Semantic Endpoint Detection for Full-Duplex Speech Interaction
Abstract:
Spoken dialogue models have significantly advanced intelligent human-computer interaction, yet they lack a plug-and-play full-duplex prediction module for semantic endpoint detection, hindering seamless audio interactions. In this paper, we introduce Phoenix-VAD, an LLM-based model that enables streaming semantic endpoint detection. Specifically, Phoenix-VAD leverages the semantic comprehension capability of the LLM and a sliding window training strategy to achieve reliable semantic endpoint detection while supporting streaming inference. Experiments on both semantically complete and incomplete speech scenarios indicate that Phoenix-VAD achieves excellent and competitive performance. Furthermore, this design enables the full-duplex prediction module to be optimized independently of the dialogue model, providing more reliable and flexible support for next-generation human-computer interaction.
中文: Phoenix-VAD是一种基于大语言模型的新型流式语义端点检测方法,通过语义理解和滑动窗口训练实现高性能全双工人机交互,在不同语音场景下均表现出色。
English: Phoenix-VAD is a novel LLM-based model that enables streaming semantic endpoint detection for seamless full-duplex human-computer interaction, achieving competitive performance across various speech scenarios.

Authors:Alireza Salemi, Cheng Li, Mingyang Zhang, Qiaozhu Mei, Zhuowan Li, Spurthi Amba Hombaiah, Weize Kong, Tao Chen, Hamed Zamani, Michael Bendersky
Title: Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering
Abstract:
Personalization is essential for adapting question answering (QA) systems to user-specific information needs, thereby improving both accuracy and user satisfaction. However, personalized QA remains relatively underexplored due to challenges such as inferring preferences from long, noisy, and implicit contexts, and generating responses that are simultaneously correct, contextually appropriate, and aligned with user expectations and background knowledge. To address these challenges, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning. The approach models the reasoning of an LLM as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark for personalized QA show that PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement. Human evaluation corroborates these results, with annotators preferring outputs from PoT in 66% of cases and reporting ties in only 15% of cases.
Chinese: 提出的思维路径(PoT)方法通过动态探索多样化推理轨迹并依据推断的用户偏好进行整合,使大语言模型能够生成个性化问答响应,在基准测试和人工评估中均实现了显著性能提升。
English: The proposed Pathways of Thoughts (PoT) method enables large language models to generate personalized question answering responses by dynamically exploring diverse reasoning trajectories and aggregating them based on inferred user preferences, achieving significant performance improvements on benchmarks and human evaluations.

Authors:Shengye Song, Minxian Xu, Kan Hu, Wenxia Guo, Kejiang Ye
Title: TD3-Sched: Learning to Orchestrate Container-based Cloud-Edge Resources via Distributed Reinforcement Learning
Abstract:
Resource scheduling in cloud-edge systems is challenging as edge nodes run latency-sensitive workloads under tight resource constraints, while existing centralized schedulers can suffer from performance bottlenecks and user experience degradation. To address the issues of distributed decisions in cloud-edge environments, we present TD3-Sched, a distributed reinforcement learning (DRL) scheduler based on Twin Delayed Deep Deterministic Policy Gradient (TD3) for continuous control of CPU and memory allocation, which can achieve optimized decisions for resource provisioning under dynamic workloads. On a realistic cloud-edge testbed with SockShop application and Alibaba traces, TD3-Sched achieves reductions of 17.9% to 38.6% in latency under same loads compared with other reinforcement-learning and rule-based baselines, and 16% to 31.6% under high loads. TD3-Sched also shows superior Service Level Objective (SLO) compliance with only 0.47% violations. These results indicate faster convergence, lower latency, and more stable performance while preserving service quality in container-based cloud-edge environment compared with the baselines.
中文: TD3-Sched作为一种分布式强化学习调度器,在云边系统中优化CPU和内存资源分配,相比现有方法,在动态工作负载下显著降低了延迟并提升了服务质量。
English: TD3-Sched is a distributed reinforcement learning scheduler that optimizes CPU and memory allocation in cloud-edge systems, significantly reducing latency and improving service quality under dynamic workloads compared to existing methods.

Authors:Gabriele Formis, Gianluca Cena, Lukasz Wisniewski, Stefano Scanzio
Title: Accurate and Efficient Prediction of Wi-Fi Link Quality Based on Machine Learning
Abstract:
Wireless communications are characterized by their unpredictability, posing challenges for maintaining consistent communication quality. This paper presents a comprehensive analysis of various prediction models, with a focus on achieving accurate and efficient Wi-Fi link quality forecasts using machine learning techniques. Specifically, the paper evaluates the performance of data-driven models based on the linear combination of exponential moving averages, which are designed for low-complexity implementations and are then suitable for hardware platforms with limited processing resources. Accuracy of the proposed approaches was assessed using experimental data from a real-world Wi-Fi testbed, considering both channel-dependent and channel-independent training data. Remarkably, channel-independent models, which allow for generalized training by equipment manufacturers, demonstrated competitive performance. Overall, this study provides insights into the practical deployment of machine learning-based prediction models for enhancing Wi-Fi dependability in industrial environments.
中文摘要:本文分析用于预测Wi-Fi链路质量的机器学习模型,发现低复杂度的信道无关模型表现优异,支持在工业环境中的实际部署。
English Summary: This paper analyzes machine learning models for predicting Wi-Fi link quality, finding that low-complexity channel-independent models perform competitively and support practical deployment in industrial settings.

Authors:Marcin Chrapek, Marcin Copik, Etienne Mettaz, Torsten Hoefler
Title: Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs
Abstract:
Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel's TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by AMX. We run LLM inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and observing throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential LLMs (cLLMs).
中文: 大语言模型在隐私敏感领域面临安全挑战,但可信执行环境通过为CPU和GPU提供端到端的机密推理保护,以可接受的性能损耗实现了实际可行的解决方案。
English: Large Language Models face security challenges in privacy-sensitive sectors, but Trusted Execution Environments (TEEs) offer a practical solution by enabling confidential LLM inference with minimal performance overhead on both CPUs and GPUs.

Authors:Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jianbin Zheng, Yuxi Ren, Xuefeng Xiao
Title: Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation
Abstract:
Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.
中文: Hyper-Bagel 是一个统一加速框架,通过推测解码和多阶段蒸馏技术显著提升多模态理解与生成任务效率,在保持输出质量的同时实现理解速度提升2倍以上、生成任务加速高达22倍。
English: Hyper-Bagel is a unified acceleration framework that significantly speeds up multimodal understanding and generation tasks using speculative decoding and multi-stage distillation, achieving over 2x faster understanding and up to 22x faster generation while maintaining output quality.

Authors:Wangjie Li, Xingjia Xie, Yishuang Li, Wenhao Guan, Kaidi Wang, Pengyu Ren, Lin Li, Qingyang Hong
Title: XMUspeech Systems for the ASVspoof 5 Challenge
Abstract:
In this paper, we present our submitted XMUspeech systems to the speech deepfake detection track of the ASVspoof 5 Challenge. Compared to previous challenges, the audio duration in ASVspoof 5 database has significantly increased. And we observed that merely adjusting the input audio length can substantially improve system performance. To capture artifacts at multiple levels, we explored the performance of AASIST, HM-Conformer, Hubert, and Wav2vec2 with various input features and loss functions. Specifically, in order to obtain artifact-related information, we trained self-supervised models on the dataset containing spoofing utterances as the feature extractors. And we applied an adaptive multi-scale feature fusion (AMFF) method to integrate features from multiple Transformer layers with the hand-crafted feature to enhance the detection capability. In addition, we conducted extensive experiments on one-class loss functions and provided optimized configurations to better align with the anti-spoofing task. Our fusion system achieved a minDCF of 0.4783 and an EER of 20.45% in the closed condition, and a minDCF of 0.2245 and an EER of 9.36% in the open condition.
Chinese: 本文介绍了针对ASVspoof 5挑战赛的XMUspeech系统,通过优化输入音频长度、采用多种模型与自适应特征融合方法,显著提升了深度伪造语音检测性能,并在封闭和开放条件下均取得了优异结果。
English: This paper introduces the XMUspeech systems for the ASVspoof 5 Challenge, which improved deepfake speech detection by optimizing input audio length, employing multiple models with adaptive feature fusion, and achieving notable performance in both closed and open conditions.

Authors:Sunhao Dai, Jiakai Tang, Jiahua Wu, Kun Wang, Yuxuan Zhu, Bingjun Chen, Bangyang Hong, Yu Zhao, Cong Fu, Kangle Wu, Yabo Ni, Anxiang Zeng, Wenjie Wang, Xu Chen, Jun Xu, See-Kiong Ng
Title: OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
Abstract:
Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems. In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over $+2\%$ GMV/UU and a $+2.90\%$ increase in advertising revenue.
中文摘要:OnePiece框架将大型语言模型的上下文工程与多步推理机制融入工业级排序系统,在Shopee个性化搜索场景中实现了关键业务指标的显著提升。
English Summary: The OnePiece framework introduces LLM-style context engineering and multi-step reasoning into industrial ranking systems, achieving significant business improvements in Shopee's personalized search.

Authors:Guowei Liu, Le Liang, Chongtao Guo, Hao Ye, Shi Jin
Title: RSU-Assisted Resource Allocation for Collaborative Perception
Abstract:
As a pivotal technology for autonomous driving, collaborative perception enables vehicular agents to exchange perceptual data through vehicle-to-everything (V2X) communications, thereby enhancing perception accuracy of all collaborators. However, existing collaborative perception frameworks often assume ample communication resources, which is usually impractical in real-world vehicular networks. To address this challenge, this paper investigates the problem of communication resource allocation for collaborative perception and proposes RACooper, a novel RSU-assisted resource allocation framework that maximizes perception accuracy under constrained communication resources. RACooper leverages a hierarchical reinforcement learning model to dynamically allocate communication resources while accounting for real-time sensing data and channel dynamics induced by vehicular mobility. By jointly optimizing spatial confidence metrics and channel state information, our approach ensures efficient feature transmission, enhancing the effectiveness of collaborative perception. Simulation results demonstrate that compared to conventional baseline algorithms, RACooper achieves significant improvements in perception accuracy, especially under bandwidth-constrained scenarios.
中文: 本文提出RACooper框架,通过路侧单元辅助的分层强化学习动态分配车联网通信资源,在带宽受限条件下有效提升协同感知的精度。
English: This paper introduces RACooper, an RSU-assisted resource allocation framework using hierarchical reinforcement learning to optimize communication resources for collaborative perception in autonomous driving, significantly improving accuracy under bandwidth constraints.

Authors:Kuang-Da Wang, Shuoyang Ding, Chao-Han Huck Yang, Ping-Chun Hsieh, Wen-Chih Peng, Vitaly Lavrukhin, Boris Ginsburg
Title: Extending Automatic Machine Translation Evaluation to Book-Length Documents
Abstract:
Despite Large Language Models (LLMs) demonstrating superior translation performance and long-context capabilities, evaluation methodologies remain constrained to sentence-level assessment due to dataset limitations, token number restrictions in metrics, and rigid sentence boundary requirements. We introduce SEGALE, an evaluation scheme that extends existing automatic metrics to long-document translation by treating documents as continuous text and applying sentence segmentation and alignment methods. Our approach enables previously unattainable document-level evaluation, handling translations of arbitrary length generated with document-level prompts while accounting for under-/over-translations and varied sentence boundaries. Experiments show our scheme significantly outperforms existing long-form document evaluation schemes, while being comparable to evaluations performed with groundtruth sentence alignments. Additionally, we apply our scheme to book-length texts and newly demonstrate that many open-weight LLMs fail to effectively translate documents at their reported maximum context lengths.
中文摘要:SEGALE提出了一种针对长文档翻译的评估方案,通过将文档视为连续文本并应用句子分割对齐方法,突破了传统句子级评估的限制,实验证明该方案显著优于现有长文档评估方法。
English Summary: SEGALE introduces a document-level evaluation scheme for long-document translation by extending existing metrics to handle continuous text with flexible segmentation, overcoming limitations of sentence-level assessments and demonstrating superior performance in experiments.

Authors:Yan Rong, Chenxing Li, Dong Yu, Li Liu
Title: AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning
Abstract:
Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks. The code will be made publicly available.
Chinese: AudioGenie-Reasoner (AGR) 是一种创新的免训练多智能体系统,通过将音频转化为文本并迭代优化证据链,弥合了音频感知与推理之间的差距,在音频深度推理任务中实现了最先进的性能。
English: AudioGenie-Reasoner (AGR) is a novel training-free multi-agent system that bridges the gap between audio perception and reasoning by transforming audio into text and iteratively refining evidence chains, achieving state-of-the-art performance in audio deep reasoning tasks.

Authors:Yingzhen Hu, Yiheng Zhong, Ruobing Li, Yingxue Su, Jiabao An, Feilong Tang, Jionglong Su, Imran Razzak
Title: SAM-DCE: Addressing Token Uniformity and Semantic Over-Smoothing in Medical Segmentation
Abstract:
The Segment Anything Model (SAM) demonstrates impressive zero-shot segmentation ability on natural images but encounters difficulties in medical imaging due to domain shifts, anatomical variability, and its reliance on user-provided prompts. Recent prompt-free adaptations alleviate the need for expert intervention, yet still suffer from limited robustness and adaptability, often overlooking the issues of semantic over-smoothing and token uniformity. We propose SAM-DCE, which balances local discrimination and global semantics while mitigating token uniformity, enhancing inter-class separability, and enriching mask decoding with fine-grained, consistent representations. Extensive experiments on diverse medical benchmarks validate its effectiveness.
中文: Segment Anything模型(SAM)在医学影像中因领域差异和提示依赖而受限,提出的SAM-DCE通过平衡局部与全局特征、增强类别区分性,在多种医学基准测试中有效提升了分割性能。
English: The Segment Anything Model (SAM) faces challenges in medical imaging due to domain shifts and prompt dependency, which the proposed SAM-DCE overcomes by balancing local and global features to improve segmentation accuracy across diverse benchmarks.

Authors:Eason Chen, Chuangji Li, Shizhuo Li, Conrad Borchers, Zimo Xiao, Chloe Qianhui Zhao, Jionghao Lin, Kenneth R. Koedinger
Title: Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook
Abstract:
Technology-enhanced learning environments often help students retrieve relevant learning content for questions arising during self-paced study. Large language models (LLMs) have emerged as novel aids for information retrieval during learning. While LLMs are effective for general-purpose question-answering, they typically lack alignment with the domain knowledge of specific course materials such as textbooks and slides. We investigate Retrieval-Augmented Generation (RAG) and GraphRAG, a knowledge graph-enhanced RAG approach, for page-level question answering in an undergraduate mathematics textbook. While RAG has been effective for retrieving discrete, contextually relevant passages, GraphRAG may excel in modeling interconnected concepts and hierarchical knowledge structures. We curate a dataset of 477 question-answer pairs, each tied to a distinct textbook page. We then compare the standard embedding-based RAG methods to GraphRAG for evaluating both retrieval accuracy-whether the correct page is retrieved-and generated answer quality via F1 scores. Our findings show that embedding-based RAG achieves higher retrieval accuracy and better F1 scores compared to GraphRAG, which tends to retrieve excessive and sometimes irrelevant content due to its entity-based structure. We also explored re-ranking the retrieved pages with LLM and observed mixed results, including performance drop and hallucinations when dealing with larger context windows. Overall, this study highlights both the promises and challenges of page-level retrieval systems in educational contexts, emphasizing the need for more refined retrieval methods to build reliable AI tutoring solutions in providing reference page numbers.
中文: 本研究比较了检索增强生成(RAG)和图增强RAG(GraphRAG)的教材页面检索效果,发现标准RAG在准确性和答案质量上更优,同时揭示了教育AI系统在提供精确页面参考时面临的挑战。
English: This study compares Retrieval-Augmented Generation (RAG) and GraphRAG for textbook page retrieval, finding that standard RAG outperforms GraphRAG in accuracy and answer quality while highlighting the challenges of educational AI systems.

Authors:Eason Chen, Chuangji Li, Shizhuo Li, Zimo Xiao, Jionghao Lin, Kenneth R. Koedinger
Title: Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook
Abstract:
Technology-enhanced learning environments often help students retrieve relevant learning content for questions arising during self-paced study. Large language models (LLMs) have emerged as novel aids for information retrieval during learning. While LLMs are effective for general-purpose question-answering, they typically lack alignment with the domain knowledge of specific course materials such as textbooks and slides. We investigate Retrieval-Augmented Generation (RAG) and GraphRAG, a knowledge graph-enhanced RAG approach, for page-level question answering in an undergraduate mathematics textbook. While RAG has been effective for retrieving discrete, contextually relevant passages, GraphRAG may excel in modeling interconnected concepts and hierarchical knowledge structures. We curate a dataset of 477 question-answer pairs, each tied to a distinct textbook page. We then compare the standard embedding-based RAG methods to GraphRAG for evaluating both retrieval accuracy-whether the correct page is retrieved-and generated answer quality via F1 scores. Our findings show that embedding-based RAG achieves higher retrieval accuracy and better F1 scores compared to GraphRAG, which tends to retrieve excessive and sometimes irrelevant content due to its entity-based structure. We also explored re-ranking the retrieved pages with LLM and observed mixed results, including performance drop and hallucinations when dealing with larger context windows. Overall, this study highlights both the promises and challenges of page-level retrieval systems in educational contexts, emphasizing the need for more refined retrieval methods to build reliable AI tutoring solutions in providing reference page numbers.
中文: 本研究比较了检索增强生成(RAG)和图增强RAG(GraphRAG)的教材页面检索效果,发现标准RAG在准确性和答案质量上更优,同时揭示了教育AI系统在提供精确页面参考时面临的挑战。
English: This study compares Retrieval-Augmented Generation (RAG) and GraphRAG for textbook page retrieval, finding that standard RAG outperforms GraphRAG in accuracy and answer quality while highlighting the challenges of educational AI systems.

Authors:Haoyuan Li, Rui Liu, Hehe Fan, Yi Yang
Title: Text-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding
Abstract:
Enabling agents to understand and interact with complex 3D scenes is a fundamental challenge for embodied artificial intelligence systems. While Multimodal Large Language Models (MLLMs) have achieved significant progress in 2D image understanding, extending such capabilities to 3D scenes remains difficult: 1) 3D environment involves richer concepts such as spatial relationships, affordances, physics, layout, and so on, 2) the absence of large-scale 3D vision-language datasets has posed a significant obstacle. In this paper, we introduce Text-Scene, a framework that automatically parses 3D scenes into textual descriptions for scene understanding. Given a 3D scene, our model identifies object attributes and spatial relationships, and then generates a coherent summary of the whole scene, bridging the gap between 3D observation and language without requiring human-in-the-loop intervention. By leveraging both geometric analysis and MLLMs, Text-Scene produces descriptions that are accurate, detailed, and human-interpretable, capturing object-level details and global-level context. Experimental results on benchmarks demonstrate that our textual parses can faithfully represent 3D scenes and benefit downstream tasks. To evaluate the reasoning capability of MLLMs, we present InPlan3D, a comprehensive benchmark for 3D task planning, consisting of 3174 long-term planning tasks across 636 indoor scenes. We emphasize clarity and accessibility in our approach, aiming to make 3D scene content understandable through language. Code and datasets will be released.
中文摘要:Text-Scene框架通过几何分析和多模态模型自动将3D场景转化为文本描述,解决了3D场景理解的难题,无需人工干预即可服务于下游任务。
English Summary: The Text-Scene framework automatically converts 3D scenes into textual descriptions using geometric analysis and multimodal models, addressing the challenge of 3D scene understanding and benefiting downstream tasks without human intervention.

Authors:Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Qinhong Yang, Tao Gong, Qi Chu, Mang Ye, Nenghai Yu
Title: Towards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification
Abstract:
In real applications, person re-identification (ReID) is expected to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets can not meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 403k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans 21 months, and 270 volunteers were photographed on average 29.1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.
Chinese: 本文提出了任意时间行人重识别(AT-ReID)这一新任务,旨在跨多种时间场景检索目标人物,并基于AT-USTC数据集和Uni-AT统一模型,实现了在所有条件下的有效均衡检索性能。
English: This paper introduces Anytime Person Re-identification (AT-ReID), a novel task for retrieving individuals across multiple time scenarios, supported by the AT-USTC dataset and a unified model called Uni-AT that achieves effective and balanced performance in all conditions.

Authors:Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang
Title: Robust LLM Training Infrastructure at ByteDance
Abstract:
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
大规模语言模型的训练规模迅速扩大但故障频发,需要像ByteRobust这样的鲁棒系统通过先进的容错与恢复机制来保障持续高效的训练。
The training of large language models is expanding rapidly but faces frequent failures, requiring robust systems like ByteRobust to ensure continuous and efficient operation through advanced fault tolerance and recovery mechanisms.

Authors:Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang
Title: Robust LLM Training Infrastructure at ByteDance
Abstract:
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
大规模语言模型的训练规模迅速扩大但故障频发,需要像ByteRobust这样的鲁棒系统通过先进的容错与恢复机制来保障持续高效的训练。
The training of large language models is expanding rapidly but faces frequent failures, requiring robust systems like ByteRobust to ensure continuous and efficient operation through advanced fault tolerance and recovery mechanisms.

Authors:Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang
Title: Distribution-Aligned Decoding for Efficient LLM Task Adaptation
Abstract:
Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.
中文摘要:导向向量解码(SVD)是一种轻量级方法,通过在解码过程中利用短暂预热微调的梯度来对齐语言模型输出与任务分布,无需额外参数即可提升多项任务准确率。
English Summary: Steering Vector Decoding (SVD) is a lightweight method that aligns language model outputs with task distributions during decoding, achieving accuracy improvements without extra parameters by leveraging gradients from a brief warm-up fine-tuning.

Authors:Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, Yuguang Fang
Title: Distribution-Aligned Decoding for Efficient LLM Task Adaptation
Abstract:
Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.
中文摘要:导向向量解码(SVD)是一种轻量级方法,通过在解码过程中利用短暂预热微调的梯度来对齐语言模型输出与任务分布,无需额外参数即可提升多项任务准确率。
English Summary: Steering Vector Decoding (SVD) is a lightweight method that aligns language model outputs with task distributions during decoding, achieving accuracy improvements without extra parameters by leveraging gradients from a brief warm-up fine-tuning.

Authors:Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julia Tan, Kevin Zhu, Ashwinee Panda, Ryan Lagasse, Vasu Sharma
Title: Evaluation Awareness Scales Predictably in Open-Weights Large Language Models
Abstract:
Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.
中文: 大型语言模型表现出评估意识,即它们能区分测试和部署环境,这种欺骗性行为随模型规模按幂律关系可预测地增强,为未来模型的行为预测提供了依据,并指导人工智能安全评估策略的设计。
English: Large language models exhibit evaluation awareness, where they distinguish between testing and deployment contexts, with this deceptive behavior scaling predictably with model size according to a power-law relationship, enabling forecasts for future models and informing AI safety evaluation strategies.

Authors:Pedro Garcia Lopez, Daniel Barcelona Pons, Marcin Copik, Torsten Hoefler, Eduardo Quiñones, Maciej Malawski, Peter Pietzutch, Alberto Marti, Thomas Ohlson Timoudas, Aleksander Slominski
Title: AI Factories: It's time to rethink the Cloud-HPC divide
Abstract:
The strategic importance of artificial intelligence is driving a global push toward Sovereign AI initiatives. Nationwide governments are increasingly developing dedicated infrastructures, called AI Factories (AIF), to achieve technological autonomy and secure the resources necessary to sustain robust local digital ecosystems. In Europe, the EuroHPC Joint Undertaking is investing hundreds of millions of euros into several AI Factories, built atop existing high-performance computing (HPC) supercomputers. However, while HPC systems excel in raw performance, they are not inherently designed for usability, accessibility, or serving as public-facing platforms for AI services such as inference or agentic applications. In contrast, AI practitioners are accustomed to cloud-native technologies like Kubernetes and object storage, tools that are often difficult to integrate within traditional HPC environments. This article advocates for a dual-stack approach within supercomputers: integrating both HPC and cloud-native technologies. Our goal is to bridge the divide between HPC and cloud computing by combining high performance and hardware acceleration with ease of use and service-oriented front-ends. This convergence allows each paradigm to amplify the other. To this end, we will study the cloud challenges of HPC (Serverless HPC) and the HPC challenges of cloud technologies (High-performance Cloud).
中文: 本文提出在超级计算机中采用高性能计算与云原生技术并存的双栈架构,旨在弥合欧洲AI工厂等主权AI计划中性能与易用性之间的鸿沟。
English: The article proposes a dual-stack approach integrating HPC and cloud-native technologies in supercomputers to bridge performance and usability gaps for Sovereign AI initiatives like Europe's AI Factories.

Authors:Hamied Nabizada, Lasse Beers, Alain Chahine, Felix Gehlhoff, Oliver Niggemann, Alexander Fay
Title: Bridging Engineering and AI Planning through Model-Based Knowledge Transformation for the Validation of Automated Production System Variants
Abstract:
Engineering models created in Model-Based Systems Engineering (MBSE) environments contain detailed information about system structure and behavior. However, they typically lack symbolic planning semantics such as preconditions, effects, and constraints related to resource availability and timing. This limits their ability to evaluate whether a given system variant can fulfill specific tasks and how efficiently it performs compared to alternatives. To address this gap, this paper presents a model-driven method that enables the specification and automated generation of symbolic planning artifacts within SysML-based engineering models. A dedicated SysML profile introduces reusable stereotypes for core planning constructs. These are integrated into existing model structures and processed by an algorithm that generates a valid domain file and a corresponding problem file in Planning Domain Definition Language (PDDL). In contrast to previous approaches that rely on manual transformations or external capability models, the method supports native integration and maintains consistency between engineering and planning artifacts. The applicability of the method is demonstrated through a case study from aircraft assembly. The example illustrates how existing engineering models are enriched with planning semantics and how the proposed workflow is applied to generate consistent planning artifacts from these models. The generated planning artifacts enable the validation of system variants through AI planning.
中文摘要:本文提出一种模型驱动方法,通过专用SysML配置文件将规划语义集成到系统工程模型中,并利用算法自动生成PDDL规划文件,从而支持基于AI规划的系统变体验证。
English Summary: This paper introduces a model-driven method to automatically generate symbolic planning artifacts from SysML-based engineering models, enabling AI planning validation of system variants through a dedicated SysML profile and algorithm that produces PDDL files.

Authors:Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi
Title: Learning to Generate 4D LiDAR Sequences
Abstract:
While generative world models have advanced video and occupancy-based data synthesis, LiDAR generation remains underexplored despite its importance for accurate 3D perception. Extending generation to 4D LiDAR data introduces challenges in controllability, temporal stability, and evaluation. We present LiDARCrafter, a unified framework that converts free-form language into editable LiDAR sequences. Instructions are parsed into ego-centric scene graphs, which a tri-branch diffusion model transforms into object layouts, trajectories, and shapes. A range-image diffusion model generates the initial scan, and an autoregressive module extends it into a temporally coherent sequence. The explicit layout design further supports object-level editing, such as insertion or relocation. To enable fair assessment, we provide EvalSuite, a benchmark spanning scene-, object-, and sequence-level metrics. On nuScenes, LiDARCrafter achieves state-of-the-art fidelity, controllability, and temporal consistency, offering a foundation for LiDAR-based simulation and data augmentation.
中文:LiDARCrafter是一个创新框架,通过场景图和扩散模型将语言指令转化为可编辑的4D激光雷达序列,在保真度和可控性方面达到领先水平,同时支持对象级编辑和全面评估。
English: LiDARCrafter is a novel framework that converts language instructions into editable 4D LiDAR sequences through scene graphs and diffusion models, achieving state-of-the-art performance in fidelity and controllability while enabling object-level editing and comprehensive evaluation.

Authors:Wenchao Gu, Yupan Chen, Yanlin Wang, Hongyu Zhang, Cuiyun Gao, Michael R. Lyu
Title: Weakly Supervised Vulnerability Localization via Multiple Instance Learning
Abstract:
Software vulnerability detection has emerged as a significant concern in the field of software security recently, capturing the attention of numerous researchers and developers. Most previous approaches focus on coarse-grained vulnerability detection, such as at the function or file level. However, the developers would still encounter the challenge of manually inspecting a large volume of code inside the vulnerable function to identify the specific vulnerable statements for modification, indicating the importance of vulnerability localization. Training the model for vulnerability localization usually requires ground-truth labels at the statement-level, and labeling vulnerable statements demands expert knowledge, which incurs high costs. Hence, the demand for an approach that eliminates the need for additional labeling at the statement-level is on the rise. To tackle this problem, we propose a novel approach called WAVES for WeAkly supervised Vulnerability Localization via multiplE inStance learning, which does not need the additional statement-level labels during the training. WAVES has the capability to determine whether a function is vulnerable (i.e., vulnerability detection) and pinpoint the vulnerable statements (i.e., vulnerability localization). Specifically, inspired by the concept of multiple instance learning, WAVES converts the ground-truth label at the function-level into pseudo labels for individual statements, eliminating the need for additional statement-level labeling. These pseudo labels are utilized to train the classifiers for the function-level representation vectors. Extensive experimentation on three popular benchmark datasets demonstrates that, in comparison to previous baselines, our approach achieves comparable performance in vulnerability detection and state-of-the-art performance in statement-level vulnerability localization.
Chinese: WAVES方法通过多示例学习提出了一种新颖的软件漏洞定位技术,无需额外语句级标注即可实现漏洞检测,并在定位漏洞语句方面达到了最先进的性能水平。
English: The WAVES approach introduces a novel method for software vulnerability localization using multiple instance learning, eliminating the need for costly statement-level labels while achieving state-of-the-art performance in pinpointing vulnerable code statements.

Authors:Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan
Title: Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
Abstract:
Sparse Mixture-of-Experts (SMoE) architectures are widely used in large language models (LLMs) due to their computational efficiency. However, though only a few experts are activated for each token, SMoE still requires loading all expert parameters, leading to high memory usage and challenges in deployment. Previous work has tried to reduce the overhead by pruning and merging experts, but primarily focused on expert-level operations, leaving neuron-level structure underexplored. We propose DERN (Dropping Experts, Recombining Neurons), a task-agnostic and retraining-free framework for expert pruning and reconstruction. We observe that experts are often misaligned and contain semantic conflicts at the neuron level, which poses challenges for direct merging. To solve this, DERN works in three steps: it first prunes redundant experts using router statistics; then it decomposes them into neuron-level expert segments, assigning each segment to its most compatible retained expert; and finally, it merges segments within each retained expert to build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE models show that DERN improves performance by more than 5% on commonsense reasoning and MMLU benchmarks under 50% expert sparsity, without extra training. It also greatly reduces the number of experts and memory usage, making SMoE LLMs easier to deploy in practice.
中文: DERN框架通过剪枝冗余专家并重组其神经元来构建更紧凑的稀疏专家混合模型,无需重新训练即可显著提升性能并大幅降低内存使用。
English: DERN is a novel framework that prunes redundant experts and recombines their neurons to create a more compact SMoE model, significantly improving performance and reducing memory usage without requiring retraining.

Authors:Franklin Yiu, Mohan Lu, Nina Li, Kevin Joseph, Tianxu Zhang, Julian Togelius, Timothy Merino, Sam Earle
Title: A Markovian Framing of WaveFunctionCollapse for Procedurally Generating Aesthetically Complex Environments
Abstract:
Procedural content generation often requires satisfying both designer-specified objectives and adjacency constraints implicitly imposed by the underlying tile set. To address the challenges of jointly optimizing both constraints and objectives, we reformulate WaveFunctionCollapse (WFC) as a Markov Decision Process (MDP), enabling external optimization algorithms to focus exclusively on objective maximization while leveraging WFC's propagation mechanism to enforce constraint satisfaction. We empirically compare optimizing this MDP to traditional evolutionary approaches that jointly optimize global metrics and local tile placement. Across multiple domains with various difficulties, we find that joint optimization not only struggles as task complexity increases, but consistently underperforms relative to optimization over the WFC-MDP, underscoring the advantages of decoupling local constraint satisfaction from global objective optimization.
中文: 本研究将波函数坍缩重构为马尔可夫决策过程,成功将全局目标优化与局部约束满足相分离,在多项复杂任务中展现出优于传统联合优化方法的性能表现。
English: This study reformulates WaveFunctionCollapse as a Markov Decision Process to separate global objective optimization from local constraint satisfaction, demonstrating superior performance over traditional joint optimization methods across various complex tasks.

Authors:Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel
Title: LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations
Abstract:
Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.
Chinese: 本文提出LD-ViCE新型框架,通过潜在扩散技术为视频AI模型高效生成逼真可解释的反事实说明,在多个数据集中显著提升计算效率与解释质量。
English: The paper introduces LD-ViCE, a novel framework that efficiently generates realistic and interpretable counterfactual explanations for video-based AI models by leveraging latent diffusion, significantly improving computational efficiency and explanation quality across diverse datasets.

Authors:Ze-Xin Yin, Jiaxiong Qiu, Liu Liu, Xinjie Wang, Wei Sui, Zhizhong Su, Jian Yang, Jin Xie
Title: DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation
Abstract:
The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text-and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme leads to efficient convergence trained on merely 69k multi-view instances. Our code, pre-trained weights, and the dataset used will be publicly available via our project page: https://zx-yin.github.io/dreamlifting/.
中文: 本文提出了轻量化高斯资产适配器(LGAA),一种利用多视角扩散先验统一几何与PBR材质建模的模块化框架,通过仅需6.9万样本的高效训练即可实现端到端3D资产生成,并提取出高质量可重照明的网格资源。
English: The paper introduces the Lightweight Gaussian Asset Adapter (LGAA), a modular framework that unifies geometry and PBR material modeling using multi-view diffusion priors for end-to-end 3D asset generation, achieving efficient convergence and high-quality results with minimal data.

Authors:Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, Federico Tombari
Title: CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis
Abstract:
Multi-view diffusion models have shown promise in 3D novel view synthesis, but most existing methods adopt a non-autoregressive formulation. This limits their applicability in world modeling, as they only support a fixed number of views and suffer from slow inference due to denoising all frames simultaneously. To address these limitations, we propose CausNVS, a multi-view diffusion model in an autoregressive setting, which supports arbitrary input-output view configurations and generates views sequentially. We train CausNVS with causal masking and per-frame noise, using pairwise-relative camera pose encodings (CaPE) for precise camera control. At inference time, we combine a spatially-aware sliding-window with key-value caching and noise conditioning augmentation to mitigate drift. Our experiments demonstrate that CausNVS supports a broad range of camera trajectories, enables flexible autoregressive novel view synthesis, and achieves consistently strong visual quality across diverse settings. Project page: https://kxhit.github.io/CausNVS.html.
中文:CausNVS提出了一种自回归多视角扩散模型,通过精确的相机控制和增强技术,实现了灵活的新视角合成,并在不同设置下保持出色的视觉质量。
English: CausNVS introduces an autoregressive multi-view diffusion model that enables flexible novel view synthesis with precise camera control and strong visual quality across various configurations.

Authors:Afif Boudaoud, Alexandru Calotoiu, Marcin Copik, Torsten Hoefler
Title: DaCe AD: Unifying High-Performance Automatic Differentiation for Machine Learning and Scientific Computing
Abstract:
Automatic differentiation (AD) is a set of techniques that systematically applies the chain rule to compute the gradients of functions without requiring human intervention. Although the fundamentals of this technology were established decades ago, it is experiencing a renaissance as it plays a key role in efficiently computing gradients for backpropagation in machine learning algorithms. AD is also crucial for many applications in scientific computing domains, particularly emerging techniques that integrate machine learning models within scientific simulations and schemes. Existing AD frameworks have four main limitations: limited support of programming languages, requiring code modifications for AD compatibility, limited performance on scientific computing codes, and a naive store-all solution for forward-pass data required for gradient calculations. These limitations force domain scientists to manually compute the gradients for large problems. This work presents DaCe AD, a general, efficient automatic differentiation engine that requires no code modifications. DaCe AD uses a novel ILP-based algorithm to optimize the trade-off between storing and recomputing to achieve maximum performance within a given memory constraint. We showcase the generality of our method by applying it to NPBench, a suite of HPC benchmarks with diverse scientific computing patterns, where we outperform JAX, a Python framework with state-of-the-art general AD capabilities, by more than 92 times on average without requiring any code changes.
中文: DaCe AD 是一种无需代码修改的新型自动微分引擎,采用基于整数线性规划的算法优化内存与性能,在科学计算基准测试中平均比现有框架(如 JAX)快92倍以上。
English: DaCe AD is a novel automatic differentiation engine that eliminates the need for code modifications and uses an ILP-based algorithm to optimize memory and performance, significantly outperforming existing frameworks like JAX by over 92 times on average in scientific computing benchmarks.

Authors:Zhenyu Wu, Angyuan Ma, Xiuwei Xu, Hang Yin, Yinan Liang, Ziwei Wang, Jiwen Lu, Haibin Yan
Title: MoTo: A Zero-shot Plug-in Interaction-aware Navigation for General Mobile Manipulation
Abstract:
Mobile manipulation stands as a core challenge in robotics, enabling robots to assist humans across varied tasks and dynamic daily environments. Conventional mobile manipulation approaches often struggle to generalize across different tasks and environments due to the lack of large-scale training. However, recent advances in manipulation foundation models demonstrate impressive generalization capability on a wide range of fixed-base manipulation tasks, which are still limited to a fixed setting. Therefore, we devise a plug-in module named MoTo, which can be combined with any off-the-shelf manipulation foundation model to empower them with mobile manipulation ability. Specifically, we propose an interaction-aware navigation policy to generate robot docking points for generalized mobile manipulation. To enable zero-shot ability, we propose an interaction keypoints framework via vision-language models (VLM) under multi-view consistency for both target object and robotic arm following instructions, where fixed-base manipulation foundation models can be employed. We further propose motion planning objectives for the mobile base and robot arm, which minimize the distance between the two keypoints and maintain the physical feasibility of trajectories. In this way, MoTo guides the robot to move to the docking points where fixed-base manipulation can be successfully performed, and leverages VLM generation and trajectory optimization to achieve mobile manipulation in a zero-shot manner, without any requirement on mobile manipulation expert data. Extensive experimental results on OVMM and real-world demonstrate that MoTo achieves success rates of 2.68% and 16.67% higher than the state-of-the-art mobile manipulation methods, respectively, without requiring additional training data.
中文:MoTo模块通过交互感知导航策略和基于视觉语言模型的关键点生成,赋予现有操作基础模型移动操作能力,无需额外训练数据即可实现更高的成功率。
English: The MoTo module enhances existing manipulation foundation models by enabling mobile manipulation through an interaction-aware navigation policy and vision-language model-based keypoint generation, achieving higher success rates without requiring additional training data.

Authors:Lang Xiong, Nishant Bhargava, Jeremy Chang, Jianhang Hong, Haihao Liu, Vasu Sharma, Kevin Zhu
Title: Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Abstract:
Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as "evaluation awareness." This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model's true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from "test-like" to "deploy-like" and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten "deploy-like" prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
中文: 大型语言模型在测试与部署环境中存在可量化的行为差异,改写后的"部署式"提示使诚实回答平均提升5.26%、欺骗性回答降低12.40%,这揭示了建立更贴近现实的评估框架的迫切性。
English: Large Language Models exhibit quantifiable behavioral shifts between test and deployment contexts, with rewritten "deploy-like" prompts increasing honesty by 5.26% and reducing deception by 12.40%, revealing the need for more realistic evaluation frameworks.

Authors:Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, Kevin Zhu
Title: Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Abstract:
Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as "evaluation awareness." This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model's true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from "test-like" to "deploy-like" and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten "deploy-like" prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
中文: 大型语言模型在测试与部署环境中存在可量化的行为差异,改写后的"部署式"提示使诚实回答平均提升5.26%、欺骗性回答降低12.40%,这揭示了建立更贴近现实的评估框架的迫切性。
English: Large Language Models exhibit quantifiable behavioral shifts between test and deployment contexts, with rewritten "deploy-like" prompts increasing honesty by 5.26% and reducing deception by 12.40%, revealing the need for more realistic evaluation frameworks.

Authors:Maria Parelli, Michael Oechsle, Michael Niemeyer, Federico Tombari, Andreas Geiger
Title: 3D-LATTE: Latent Space 3D Editing from Textual Instructions
Abstract:
Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity, precise, and robust edits across a wide range of shapes and semantic manipulations.
中文: 该研究提出了一种无需训练的3D编辑方法,通过融合3D注意力映射与几何感知优化策略,在原生3D扩散模型的潜在空间中直接操控几何结构,实现了跨形状与语义的高精度、强鲁棒性编辑。
English: The proposed training-free 3D editing method operates within a native 3D diffusion model's latent space, utilizing blended 3D attention maps and geometry-aware techniques to achieve superior, high-fidelity edits across diverse shapes and semantics.

Authors:Maria Parelli, Michael Oechsle, Michael Niemeyer, Federico Tombari, Andreas Geiger
Title: 3D-LATTE: Latent Space 3D Editing from Textual Instructions
Abstract:
Despite the recent success of multi-view diffusion models for text/image-based 3D asset generation, instruction-based editing of 3D assets lacks surprisingly far behind the quality of generation models. The main reason is that recent approaches using 2D priors suffer from view-inconsistent editing signals. Going beyond 2D prior distillation methods and multi-view editing strategies, we propose a training-free editing method that operates within the latent space of a native 3D diffusion model, allowing us to directly manipulate 3D geometry. We guide the edit synthesis by blending 3D attention maps from the generation with the source object. Coupled with geometry-aware regularization guidance, a spectral modulation strategy in the Fourier domain and a refinement step for 3D enhancement, our method outperforms previous 3D editing methods enabling high-fidelity, precise, and robust edits across a wide range of shapes and semantic manipulations.
中文: 该研究提出了一种无需训练的3D编辑方法,通过融合3D注意力映射与几何感知优化策略,在原生3D扩散模型的潜在空间中直接操控几何结构,实现了跨形状与语义的高精度、强鲁棒性编辑。
English: The proposed training-free 3D editing method operates within a native 3D diffusion model's latent space, utilizing blended 3D attention maps and geometry-aware techniques to achieve superior, high-fidelity edits across diverse shapes and semantics.

Authors:Tatiana Likhomanenko, Luke Carlson, Richard He Bai, Zijin Gu, Han Tran, Zakaria Aldeneh, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly
Title: ChipChat: Low-Latency Cascaded Conversational Agent in MLX
Abstract:
The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks through architectural innovations and streaming optimizations. Our system integrates streaming (a) conversational speech recognition with mixture-of-experts, (b) state-action augmented LLM, (c) text-to-speech synthesis, (d) neural vocoder, and (e) speaker modeling. Implemented using MLX, ChipChat achieves sub-second response latency on a Mac Studio without dedicated GPUs, while preserving user privacy through complete on-device processing. Our work shows that strategically redesigned CSs can overcome their historical latency limitations, offering a promising path forward for practical voice-based AI agents.
中文: ChipChat通过架构创新和流式优化提出了一种新型低延迟级联系统,在消费级硬件上实现亚秒级响应,同时通过完全在设备上处理来保护用户隐私。
English: ChipChat introduces a novel low-latency cascaded system that overcomes traditional bottlenecks through architectural innovations and streaming optimizations, achieving sub-second response times on consumer hardware while maintaining full on-device processing for privacy.

Authors:Aoming Liu, Kevin Miller, Venkatesh Saligrama, Kate Saenko, Boqing Gong, Ser-Nam Lim, Bryan A. Plummer
Title: Scaling Up Temporal Domain Generalization via Temporal Experts Averaging
Abstract:
Temporal Domain Generalization (TDG) aims to generalize across temporal distribution shifts, e.g., lexical change over time. Prior work often addresses this by predicting future model weights. However, full model prediction is prohibitively expensive for even reasonably sized models. Thus, recent methods only predict the classifier layer, limiting generalization by failing to adjust other model components. To address this, we propose Temporal Experts Averaging (TEA), a novel and scalable TDG framework that updates the entire model using weight averaging to maximize generalization potential while minimizing computational costs. Our theoretical analysis guides us to two steps that enhance generalization to future domains. First, we create expert models with functional diversity yet parameter similarity by fine-tuning a domain-agnostic base model on individual temporal domains while constraining weight changes. Second, we optimize the bias-variance tradeoff through adaptive averaging coefficients derived from modeling temporal weight trajectories in a principal component subspace. Expert's contributions are based on their projected proximity to future domains. Extensive experiments across 7 TDG benchmarks, 5 models, and 2 TDG settings shows TEA outperforms prior TDG methods by up to 69% while being up to 60x more efficient.
中文: 我们提出了时序专家平均(TEA)框架,通过将领域无关基础模型微调为权重变化受限的专家模型,并基于其在主成分子空间中与未来领域的投影邻近度自适应地平均权重,从而在多个基准测试中实现了卓越的时序领域泛化性能和效率。
English: We introduce Temporal Experts Averaging (TEA), a scalable framework that enhances temporal domain generalization by fine-tuning domain-agnostic base models into experts with constrained weight changes and adaptively averaging their weights based on projected proximity to future domains, achieving superior performance and efficiency across multiple benchmarks.

Authors:Christophe Botella, Benjamin Deneu, Diego Marcos, Maximilien Servajean, Theo Larcher, Cesar Leblanc, Joaquim Estopinan, Pierre Bonnet, Alexis Joly
Title: Overview of GeoLifeCLEF 2023: Species Composition Prediction with High Spatial Resolution at Continental Scale Using Remote Sensing
Abstract:
Understanding the spatio-temporal distribution of species is a cornerstone of ecology and conservation. By pairing species observations with geographic and environmental predictors, researchers can model the relationship between an environment and the species which may be found there. To advance the state-of-the-art in this area with deep learning models and remote sensing data, we organized an open machine learning challenge called GeoLifeCLEF 2023. The training dataset comprised 5 million plant species observations (single positive label per sample) distributed across Europe and covering most of its flora, high-resolution rasters: remote sensing imagery, land cover, elevation, in addition to coarse-resolution data: climate, soil and human footprint variables. In this multi-label classification task, we evaluated models ability to predict the species composition in 22 thousand small plots based on standardized surveys. This paper presents an overview of the competition, synthesizes the approaches used by the participating teams, and analyzes the main results. In particular, we highlight the biases faced by the methods fitted to single positive labels when it comes to the multi-label evaluation, and the new and effective learning strategy combining single and multi-label data in training.
中文: GeoLifeCLEF 2023竞赛通过结合深度学习与多源环境数据推进物种分布建模,揭示了单标签训练在多标签评估中的偏差,并提出了结合单标签和多标签数据的有效混合学习策略。
English: The GeoLifeCLEF 2023 challenge advanced species distribution modeling by using deep learning with multi-source environmental data to predict plant species composition, revealing biases in single-label training and introducing effective hybrid learning strategies.

Authors:Fang Liu, Tianze Wang, Li Zhang, Zheyu Yang, Jing Jiang, Zian Sun
Title: Explainable Fault Localization for Programming Assignments via LLM-Guided Annotation
Abstract:
Providing timely and personalized guidance for students' programming assignments, offers significant practical value for helping students complete assignments and enhance their learning. In recent years, various automated Fault Localization (FL) techniques have demonstrated promising results in identifying errors in programs. However, existing FL techniques face challenges when applied to educational contexts. Most approaches operate at the method level without explanatory feedback, resulting in granularity too coarse for students who need actionable insights to identify and fix their errors. While some approaches attempt line-level fault localization, they often depend on predicting line numbers directly in numerical form, which is ill-suited to LLMs. To address these challenges, we propose FLAME, a fine-grained, explainable Fault Localization method tailored for programming assignments via LLM-guided Annotation and Model Ensemble. FLAME leverages rich contextual information specific to programming assignments to guide LLMs in identifying faulty code lines. Instead of directly predicting line numbers, we prompt the LLM to annotate faulty code lines with detailed explanations, enhancing both localization accuracy and educational value. To further improve reliability, we introduce a weighted multi-model voting strategy that aggregates results from multiple LLMs to determine the suspiciousness of each code line. Extensive experimental results demonstrate that FLAME outperforms state-of-the-art fault localization baselines on programming assignments, successfully localizing 207 more faults at top-1 over the best-performing baseline. Beyond educational contexts, FLAME also generalizes effectively to general-purpose software codebases, outperforming all baselines on the Defects4J benchmark.
中文: FLAME提出了一种细粒度、可解释的故障定位方法,通过LLM引导的注释和模型集成,精确识别错误代码行并提供详细解释,在教育和通用软件场景中均优于现有技术。
English: FLAME introduces a fine-grained, explainable fault localization method using LLM-guided annotation and model ensemble to precisely identify faulty code lines with detailed explanations, outperforming existing techniques in both educational and general software contexts.

Authors:Jason Holmes, Yuexing Hao, Mariana Borras-Osorio, Federico Mastroleo, Santiago Romero Brufau, Valentina Carducci, Katie M Van Abel, David M Routman, Andrew Y. K. Foong, Liv M Muller, Satomi Shiraishi, Daniel K Ebner, Daniel J Ma, Sameer R Keole, Samir H Patel, Mirek Fatyga, Martin Bues, Brad J Stish, Yolanda I Garces, Michelle A Neben Wittich, Robert L Foote, Sujay A Vora, Nadia N Laack, Mark R Waddle, Wei Liu
Title: RadOnc-GPT: An Autonomous LLM Agent for Real-Time Patient Outcomes Labeling at Scale
Abstract:
Manual labeling limits the scale, accuracy, and timeliness of patient outcomes research in radiation oncology. We present RadOnc-GPT, an autonomous large language model (LLM)-based agent capable of independently retrieving patient-specific information, iteratively assessing evidence, and returning structured outcomes. Our evaluation explicitly validates RadOnc-GPT across two clearly defined tiers of increasing complexity: (1) a structured quality assurance (QA) tier, assessing the accurate retrieval of demographic and radiotherapy treatment plan details, followed by (2) a complex clinical outcomes labeling tier involving determination of mandibular osteoradionecrosis (ORN) in head-and-neck cancer patients and detection of cancer recurrence in independent prostate and head-and-neck cancer cohorts requiring combined interpretation of structured and unstructured patient data. The QA tier establishes foundational trust in structured-data retrieval, a critical prerequisite for successful complex clinical outcome labeling.
中文: RadOnc-GPT作为自主人工智能代理,通过精确检索和分析患者数据克服人工标注局限,其有效性已在结构化质量保证和复杂临床结果评估中得到验证。
English: RadOnc-GPT is an autonomous AI agent that overcomes manual labeling limitations by accurately retrieving and analyzing patient data, validated through structured QA and complex clinical outcome assessments.

Authors:Chi Zhang, Zehua Chen, Kaiwen Zheng, Jun Zhu
Title: VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale
Abstract:
Bridge models have recently been explored for speech enhancement tasks such as denoising, dereverberation, and super-resolution, while these efforts are typically confined to a single task or small-scale datasets, with constrained general speech restoration (GSR) capability at scale. In this work, we introduce VoiceBridge, a GSR system rooted in latent bridge models (LBMs), capable of reconstructing high-fidelity speech at full-band (\textit{i.e.,} 48~kHz) from various distortions. By compressing speech waveform into continuous latent representations, VoiceBridge models the~\textit{diverse LQ-to-HQ tasks} (namely, low-quality to high-quality) in GSR with~\textit{a single latent-to-latent generative process} backed by a scalable transformer architecture. To better inherit the advantages of bridge models from the data domain to the latent space, we present an energy-preserving variational autoencoder, enhancing the alignment between the waveform and latent space over varying energy levels. Furthermore, to address the difficulty of HQ reconstruction from distinctively different LQ priors, we propose a joint neural prior, uniformly alleviating the reconstruction burden of LBM. At last, considering the key requirement of GSR systems, human perceptual quality, a perceptually aware fine-tuning stage is designed to mitigate the cascading mismatch in generation while improving perceptual alignment. Extensive validation across in-domain and out-of-domain tasks and datasets (\textit{e.g.}, refining recent zero-shot speech and podcast generation results) demonstrates the superior performance of VoiceBridge. Demo samples can be visited at: https://VoiceBridge-demo.github.io/.
中文摘要:VoiceBridge是一种基于潜在桥模型的通用语音恢复系统,通过统一的潜在生成过程从多种失真中重建高保真全频带语音,并采用能量保持编码和感知微调来提升性能。
English Summary: VoiceBridge is a general speech restoration system using latent bridge models that reconstructs high-fidelity full-band speech from various distortions through a unified generative process, enhanced by energy-preserving encoding and perceptual fine-tuning.

Authors:Tianle Wang, Sirui Zhang, Xinyi Tong, Peiyang Yu, Jishang Chen, Liangke Zhao, Xinpu Gao, Yves Zhu, Tiezheng Ge, Bo Zheng, Duo Xu, Yang Liu, Xin Jin, Feng Yu, Songchun Zhu
Title: Discovering "Words" in Music: Unsupervised Learning of Compositional Sparse Code for Symbolic Music
Abstract:
This paper presents an unsupervised machine learning algorithm that identifies recurring patterns -- referred to as ``music-words'' -- from symbolic music data. These patterns are fundamental to musical structure and reflect the cognitive processes involved in composition. However, extracting these patterns remains challenging because of the inherent semantic ambiguity in musical interpretation. We formulate the task of music-word discovery as a statistical optimization problem and propose a two-stage Expectation-Maximization (EM)-based learning framework: 1. Developing a music-word dictionary; 2. Reconstructing the music data. When evaluated against human expert annotations, the algorithm achieved an Intersection over Union (IoU) score of 0.61. Our findings indicate that minimizing code length effectively addresses semantic ambiguity, suggesting that human optimization of encoding systems shapes musical semantics. This approach enables computers to extract ``basic building blocks'' from music data, facilitating structural analysis and sparse encoding. The method has two primary applications. First, in AI music, it supports downstream tasks such as music generation, classification, style transfer, and improvisation. Second, in musicology, it provides a tool for analyzing compositional patterns and offers insights into the principle of minimal encoding across diverse musical styles and composers.
中文: 本文提出了一种无监督机器学习算法,通过两阶段期望最大化框架从符号音乐数据中发现"音乐词汇",在人类标注评估中获得0.61的交并比分数,证明最小化编码长度能有效解决语义模糊问题,并为AI音乐和音乐学领域提供了应用支持。
English: This paper introduces an unsupervised machine learning algorithm that discovers "music-words" from symbolic music data through a two-stage EM framework, achieving a 0.61 IoU score against human annotations and demonstrating that minimizing code length resolves semantic ambiguity while enabling applications in AI music and musicology.

Authors:Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen
Title: LLaDA-MoE: A Sparse MoE Diffusion Language Model
Abstract:
We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.
中文摘要:LLaDA-MoE是一种采用混合专家架构的新型扩散语言模型,在推理时仅激活14亿参数即可实现最先进性能,在显著降低计算成本的同时达到了与更大模型相当的能力水平。
English Summary: LLaDA-MoE is a novel diffusion language model using Mixture-of-Experts architecture that achieves state-of-the-art performance with only 1.4B active parameters during inference, matching larger models' capabilities while significantly reducing computational costs.

Authors:Yang Ye, Tianyu He, Shuo Yang, Jiang Bian
Title: Reinforcement Learning with Inverse Rewards for World Model Post-training
Abstract:
World models simulate dynamic environments, enabling agents to interact with diverse input modalities. Although recent advances have improved the visual quality and temporal consistency of video world models, their ability of accurately modeling human-specified actions remains under-explored. Reinforcement learning presents a promising approach for directly improving the suboptimal action-following capability of pre-trained models, assuming that an appropriate reward function can be defined. However, transferring reinforcement learning post-training methods to world model is impractical due to the prohibitive cost of large-scale preference annotations and the infeasibility of constructing rule-based video verifiers. To address this gap, we propose Reinforcement Learning with Inverse Rewards (RLIR), a post-training framework that derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model. By mapping high-dimensional video modality to a low-dimensional action space, RLIR provides an objective and verifiable reward for optimization via Group Relative Policy Optimization. Experiments across autoregressive and diffusion paradigms demonstrate 5-10% gains in action-following, up to 10% improvements in visual quality, and higher human preference scores, establishing RLIR as the first post-training method specifically designed to enhance action-following in video world models.
Chinese: 提出的逆向奖励强化学习(RLIR)框架通过逆向动力学模型生成可验证的奖励信号,显著提升了视频世界模型的动作跟随能力和视觉质量。
English: The proposed Reinforcement Learning with Inverse Rewards (RLIR) framework enhances action-following in video world models by deriving verifiable reward signals through an Inverse Dynamics Model, achieving significant improvements in both action accuracy and visual quality.

Authors:Herve Goeau, Alexis Joly, Pierre Bonnet, Souheil Selmi, Jean-Francois Molino, Daniel Barthelemy, Nozha Boujemaa
Title: LifeCLEF Plant Identification Task 2014
Abstract:
The LifeCLEFs plant identification task provides a testbed for a system-oriented evaluation of plant identification about 500 species trees and herbaceous plants. Seven types of image content are considered: scan and scan-like pictures of leaf, and 6 kinds of detailed views with unconstrained conditions, directly photographed on the plant: flower, fruit, stem & bark, branch, leaf and entire view. The main originality of this data is that it was specifically built through a citizen sciences initiative conducted by Tela Botanica, a French social network of amateur and expert botanists. This makes the task closer to the conditions of a real-world application. This overview presents more precisely the resources and assessments of task, summarizes the retrieval approaches employed by the participating groups, and provides an analysis of the main evaluation results. With a total of ten groups from six countries and with a total of twenty seven submitted runs, involving distinct and original methods, this fourth year task confirms Image & Multimedia Retrieval community interest for biodiversity and botany, and highlights further challenging studies in plant identification.
中文:LifeCLEF植物识别任务通过公民科学项目收集的多样化图像,评估对500种植物的识别系统,既体现了实际应用条件,也推动了生物多样性研究领域的社区参与。
English: The LifeCLEF plant identification task evaluates systems for recognizing 500 plant species using diverse images collected through a citizen science initiative, reflecting real-world conditions and fostering community interest in biodiversity research.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: LifeCLEF Plant Identification Task 2015
Abstract:
The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2015 evaluation was actually conducted on a set of more than 100K images illustrating 1000 plant species living in West Europe. The main originality of this dataset is that it was built through a large-scale participatory sensing plateform initiated in 2011 and which now involves tens of thousands of contributors. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
中文: LifeCLEF 2015植物识别挑战通过自2011年起数万贡献者参与的众包平台,利用10万余张西欧千种植物图像评估大规模识别方法,并分析了资源、方案与成果。
English: The LifeCLEF 2015 plant identification challenge evaluated large-scale methods using over 100,000 images of 1,000 Western European species, collected through a participatory platform involving thousands of contributors since 2011, analyzing resources, approaches, and outcomes.

Authors:Junyou Wang, Zehua Chen, Binjie Yuan, Kaiwen Zheng, Chang Li, Yuxuan Jiang, Jun Zhu
Title: AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
Abstract:
Guidance methods have demonstrated significant improvements in cross-modal audio generation, including text-to-audio (T2A) and video-to-audio (V2A) generation. The popularly adopted method, classifier-free guidance (CFG), steers generation by emphasizing condition alignment, enhancing fidelity but often at the cost of diversity. Recently, autoguidance (AG) has been explored for audio generation, encouraging the sampling to faithfully reconstruct the target distribution and showing increased diversity. Despite these advances, they usually rely on a single guiding principle, e.g., condition alignment in CFG or score accuracy in AG, leaving the full potential of guidance for audio generation untapped. In this work, we explore enriching the composition of the guidance method and present a mixture-of-guidance framework, AudioMoG. Within the design space, AudioMoG can exploit the complementary advantages of distinctive guiding principles by fulfilling their cumulative benefits. With a reduced form, AudioMoG can consider parallel complements or recover a single guiding principle, without sacrificing generality. We experimentally show that, given the same inference speed, AudioMoG approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. These results highlight a "free lunch" in current cross-modal audio generation systems: higher quality can be achieved through mixed guiding principles at the sampling stage without sacrificing inference efficiency. Demo samples are available at: https://audio-mog.github.io.
中文:AudioMoG提出了一种混合引导框架,通过结合多种指导原则来增强跨模态音频生成,在不牺牲推理效率的情况下实现了更高质量。
English: AudioMoG introduces a mixture-of-guidance framework that combines multiple guiding principles to enhance cross-modal audio generation, achieving higher quality without compromising inference efficiency.

Authors:Xinrong Yang, Peizhuo Li, Hongyi Li, Junkai Lu, Linnan Chang, Yuhong Cao, Yifeng Zhang, Ge Sun, Guillaume Sartoretti
Title: HeLoM: Hierarchical Learning for Whole-Body Loco-Manipulation in Hexapod Robot
Abstract:
Robots in real-world environments are often required to move/manipulate objects comparable in weight to their own bodies. Compared to grasping and carrying, pushing provides a more straightforward and efficient non-prehensile manipulation strategy, avoiding complex grasp design while leveraging direct contact to regulate an object's pose. Achieving effective pushing, however, demands both sufficient manipulation forces and the ability to maintain stability, which is particularly challenging when dealing with heavy or irregular objects. To address these challenges, we propose HeLoM, a learning-based hierarchical whole-body manipulation framework for a hexapod robot that exploits coordinated multi-limb control. Inspired by the cooperative strategies of multi-legged insects, our framework leverages redundant contact points and high degrees of freedom to enable dynamic redistribution of contact forces. HeLoM's high-level planner plans pushing behaviors and target object poses, while its low-level controller maintains locomotion stability and generates dynamically consistent joint actions. Our policies trained in simulation are directly deployed on real robots without additional fine-tuning. This design allows the robot to maintain balance while exerting continuous and controllable pushing forces through coordinated foreleg interaction and supportive hind-leg propulsion. We validate the effectiveness of HeLoM through both simulation and real-world experiments. Results show that our framework can stably push boxes of varying sizes and unknown physical properties to designated goal poses in the real world.
中文: HeLoM框架通过分层规划与控制,协调六足机器人多肢力量并保持平衡,实现了对重型物体的稳定推动。
English: The HeLoM framework enables a hexapod robot to stably push heavy objects by coordinating multi-limb forces and maintaining balance through hierarchical planning and control.

Authors:Peizhuo Li, Hongyi Li, Yuxuan Ma, Linnan Chang, Xinrong Yang, Ruiqi Yu, Yifeng Zhang, Yuhong Cao, Qiuguo Zhu, Guillaume Sartoretti
Title: KiVi: Kinesthetic-Visuospatial Integration for Dynamic and Safe Egocentric Legged Locomotion
Abstract:
Vision-based locomotion has shown great promise in enabling legged robots to perceive and adapt to complex environments. However, visual information is inherently fragile, being vulnerable to occlusions, reflections, and lighting changes, which often cause instability in locomotion. Inspired by animal sensorimotor integration, we propose KiVi, a Kinesthetic-Visuospatial integration framework, where kinesthetics encodes proprioceptive sensing of body motion and visuospatial reasoning captures visual perception of surrounding terrain. Specifically, KiVi separates these pathways, leveraging proprioception as a stable backbone while selectively incorporating vision for terrain awareness and obstacle avoidance. This modality-balanced, yet integrative design, combined with memory-enhanced attention, allows the robot to robustly interpret visual cues while maintaining fallback stability through proprioception. Extensive experiments show that our method enables quadruped robots to stably traverse diverse terrains and operate reliably in unstructured outdoor environments, remaining robust to out-of-distribution (OOD) visual noise and occlusion unseen during training, thereby highlighting its effectiveness and applicability to real-world legged locomotion.
中文: KiVi框架通过整合动觉与视觉空间通路,使腿式机器人能够稳健地穿越复杂地形,其中动觉提供稳定的本体感知基础,而视觉则选择性用于地形感知和避障,展现出对现实环境中视觉干扰和遮挡的强大适应能力。
English: The KiVi framework integrates kinesthetic and visuospatial pathways to enable legged robots to robustly navigate complex terrains by using proprioception as a stable backbone while selectively incorporating vision for terrain awareness, demonstrating resilience to visual disturbances and occlusions in real-world environments.

Authors:Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang
Title: Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers
Abstract:
Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce Mirror-Critique, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our Mirror-Verifier significantly outperforms majority voting in terms of solution accuracy and also improves the solver's honesty to recognize and abstain from answering beyond its capability boundaries.
中文: 通过引入Mirror-Critique框架,利用模型生成解与真实解的对比产生信息丰富的评判信号来训练验证器,显著提升了测试时扩展中解决方案的准确性和模型诚实度,优于多数投票方法。
English: Test-time scaling using solution sampling and aggregation is enhanced by Mirror-Critique, a framework that trains verifiers with informative critiques derived from contrasting model-generated and ground-truth solutions, significantly improving reasoning accuracy and model honesty over majority voting.

Authors:Ping Chen, Xiang Liu, Zhaoxiang Liu, Zezhou Chen, Xingpeng Zhang, Huan Hu, Zipeng Wang, Kai Wang, Shuming Shi, Shiguo Lian
Title: Fuzzy Reasoning Chain (FRC): An Innovative Reasoning Framework from Fuzziness to Clarity
Abstract:
With the rapid advancement of large language models (LLMs), natural language processing (NLP) has achieved remarkable progress. Nonetheless, significant challenges remain in handling texts with ambiguity, polysemy, or uncertainty. We introduce the Fuzzy Reasoning Chain (FRC) framework, which integrates LLM semantic priors with continuous fuzzy membership degrees, creating an explicit interaction between probability-based reasoning and fuzzy membership reasoning. This transition allows ambiguous inputs to be gradually transformed into clear and interpretable decisions while capturing conflicting or uncertain signals that traditional probability-based methods cannot. We validate FRC on sentiment analysis tasks, where both theoretical analysis and empirical results show that it ensures stable reasoning and facilitates knowledge transfer across different model scales. These findings indicate that FRC provides a general mechanism for managing subtle and ambiguous expressions with improved interpretability and robustness.
中文:模糊推理链(FRC)框架融合了大语言模型先验与模糊隶属度,将模糊输入逐步转化为清晰可解释的决策,在情感分析任务中展现出更优的稳定性和知识迁移能力。
English: The Fuzzy Reasoning Chain (FRC) framework combines large language model priors with fuzzy membership to transform ambiguous inputs into clear decisions, enhancing interpretability and robustness in sentiment analysis tasks.

Authors:Kaike Zhang, Xiaobei Wang, Shuchang Liu, Hailan Yang, Xiang Li, Lantao Hu, Han Li, Qi Cao, Fei Sun, Kun Gai
Title: GoalRank: Group-Relative Optimization for a Large Ranking Model
Abstract:
Mainstream ranking approaches typically follow a Generator-Evaluator two-stage paradigm, where a generator produces candidate lists and an evaluator selects the best one. Recent work has attempted to enhance performance by expanding the number of candidate lists, for example, through multi-generator settings. However, ranking involves selecting a recommendation list from a combinatorially large space. Simply enlarging the candidate set remains ineffective, and performance gains quickly saturate. At the same time, recent advances in large recommendation models have shown that end-to-end one-stage models can achieve promising performance with the expectation of scaling laws. Motivated by this, we revisit ranking from a generator-only one-stage perspective. We theoretically prove that, for any (finite Multi-)Generator-Evaluator model, there always exists a generator-only model that achieves strictly smaller approximation error to the optimal ranking policy, while also enjoying scaling laws as its size increases. Building on this result, we derive an evidence upper bound of the one-stage optimization objective, from which we find that one can leverage a reward model trained on real user feedback to construct a reference policy in a group-relative manner. This reference policy serves as a practical surrogate of the optimal policy, enabling effective training of a large generator-only ranker. Based on these insights, we propose GoalRank, a generator-only ranking framework. Extensive offline experiments on public benchmarks and large-scale online A/B tests demonstrate that GoalRank consistently outperforms state-of-the-art methods.
中文摘要:主流排序方法通常采用生成器-评估器两阶段范式,但我们从理论上证明单阶段生成器模型能获得更小的近似误差并受益于规模效应,据此提出的GoalRank框架在实验中显著优于现有最优方法。
English Summary: Mainstream ranking methods use a two-stage Generator-Evaluator approach, but we demonstrate that a generator-only one-stage model can achieve better performance with smaller approximation error and scaling law benefits, leading to our proposed GoalRank framework that outperforms existing methods.

Authors:Minfeng Zhu, Zi Wang, Sizhe Ji, Zhengtong Du, Junming Ke, Xiao Deng, Zanlang Yin, Xiuqi Huang, Heyu Wang, Wei Chen
Title: GenesisGeo: Technical Report
Abstract:
We present GenesisGeo, an automated theorem prover in Euclidean geometry. We have open-sourced a large-scale geometry dataset of 21.8 million geometric problems, over 3 million of which contain auxiliary constructions. Specially, we significantly accelerate the symbolic deduction engine DDARN by 120x through theorem matching, combined with a C++ implementation of its core components. Furthermore, we build our neuro-symbolic prover, GenesisGeo, upon Qwen3-0.6B-Base, which solves 24 of 30 problems (IMO silver medal level) in the IMO-AG-30 benchmark using a single model, and achieves 26 problems (IMO gold medal level) with a dual-model ensemble.
中文:GenesisGeo是一款自动化几何定理证明器,通过定理匹配和C++核心优化将符号推理引擎提速120倍,并基于Qwen3-0.6B模型在IMO-AG-30基准测试中以双模型集成解决26题,达到IMO金牌水平。
English: GenesisGeo is an automated theorem prover that accelerates symbolic deduction by 120x through theorem matching and C++ optimization, achieving IMO gold medal-level performance by solving 26 of 30 problems with a dual-model ensemble.

Authors:Jinbang Huang, Zhiyuan Li, Zhanguang Zhang, Xingyue Quan, Jianye Hao, Yingxue Zhang
Title: Plan2Evolve: LLM Self-Evolution for Improved Planning Capability via Automated Domain Generation
Abstract:
Large Language Models (LLMs) have recently shown strong potential in robotic task planning, particularly through automatic planning domain generation that integrates symbolic search. Prior approaches, however, have largely treated these domains as search utilities, with limited attention to their potential as scalable sources of reasoning data. At the same time, progress in reasoning LLMs has been driven by chain-of-thought (CoT) supervision, whose application in robotics remains dependent on costly, human-curated datasets. We propose Plan2Evolve, an LLM self-evolving framework in which the base model generates planning domains that serve as engines for producing symbolic problem-plan pairs as reasoning traces. These pairs are then transformed into extended CoT trajectories by the same model through natural-language explanations, thereby explicitly aligning symbolic planning structures with natural language reasoning. The resulting data extend beyond the model's intrinsic planning capacity, enabling model fine-tuning that yields a planning-enhanced LLM with improved planning success, stronger cross-task generalization, and reduced inference costs.
Chinese Summary: Plan2Evolve是一种自演进的LLM框架,通过生成规划领域来产生符号化问题-规划对,并将其转化为思维链轨迹进行模型微调,从而提升规划成功率与泛化能力,同时降低推理成本。
English Summary: Plan2Evolve is a self-evolving LLM framework that generates planning domains to produce symbolic problem-plan pairs, which are transformed into chain-of-thought trajectories for model fine-tuning, enhancing planning success and generalization while reducing inference costs.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Overview of ExpertLifeCLEF 2018: how far automated identification systems are from the best experts?
Abstract:
Automated identification of plants and animals has improved considerably in the last few years, in particular thanks to the recent advances in deep learning. The next big question is how far such automated systems are from the human expertise. Indeed, even the best experts are sometimes confused and/or disagree between each others when validating visual or audio observations of living organism. A picture actually contains only a partial information that is usually not sufficient to determine the right species with certainty. Quantifying this uncertainty and comparing it to the performance of automated systems is of high interest for both computer scientists and expert naturalists. The LifeCLEF 2018 ExpertCLEF challenge presented in this paper was designed to allow this comparison between human experts and automated systems. In total, 19 deep-learning systems implemented by 4 different research teams were evaluated with regard to 9 expert botanists of the French flora. The main outcome of this work is that the performance of state-of-the-art deep learning models is now close to the most advanced human expertise. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
中文:深度学习驱动的动植物自动识别系统性能已接近人类专家水平,LifeCLEF 2018专家挑战赛通过对比4个团队的19个AI系统与9位植物学家的表现证实了这一点。
English: Automated plant and animal identification systems using deep learning are now approaching the performance level of human experts, as demonstrated by the LifeCLEF 2018 ExpertCLEF challenge that compared 19 AI systems against nine botanists.

Authors:Jiakai Tang, Yujie Luo, Xunke Xi, Fei Sun, Xueyang Feng, Sunhao Dai, Chao Yi, Dian Chen, Zhujin Gao, Yang Li, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang, Bo Zheng
Title: Interactive Recommendation Agent with Active User Commands
Abstract:
Traditional recommender systems rely on passive feedback mechanisms that limit users to simple choices such as like and dislike. However, these coarse-grained signals fail to capture users' nuanced behavior motivations and intentions. In turn, current systems cannot also distinguish which specific item attributes drive user satisfaction or dissatisfaction, resulting in inaccurate preference modeling. These fundamental limitations create a persistent gap between user intentions and system interpretations, ultimately undermining user satisfaction and harming system effectiveness. To address these limitations, we introduce the Interactive Recommendation Feed (IRF), a pioneering paradigm that enables natural language commands within mainstream recommendation feeds. Unlike traditional systems that confine users to passive implicit behavioral influence, IRF empowers active explicit control over recommendation policies through real-time linguistic commands. To support this paradigm, we develop RecBot, a dual-agent architecture where a Parser Agent transforms linguistic expressions into structured preferences and a Planner Agent dynamically orchestrates adaptive tool chains for on-the-fly policy adjustment. To enable practical deployment, we employ simulation-augmented knowledge distillation to achieve efficient performance while maintaining strong reasoning capabilities. Through extensive offline and long-term online experiments, RecBot shows significant improvements in both user satisfaction and business outcomes.
中文: 传统推荐系统依赖被动反馈机制,难以捕捉用户细粒度偏好,而提出的交互式推荐信息流(IRF)通过其RecBot双智能体架构实现了基于自然语言的主动控制,借助实时策略调整显著提升了用户满意度和系统效果。
English: Traditional recommender systems' reliance on passive feedback limits nuanced user preference capture, but the proposed Interactive Recommendation Feed (IRF) with its RecBot dual-agent architecture enables active linguistic control, significantly improving satisfaction and outcomes through real-time policy adjustments.

Authors:Jiakai Tang, Yujie Luo, Xunke Xi, Fei Sun, Xueyang Feng, Sunhao Dai, Chao Yi, Dian Chen, Zhujin Gao, Yang Li, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang, Bo Zheng
Title: Interactive Recommendation Agent with Active User Commands
Abstract:
Traditional recommender systems rely on passive feedback mechanisms that limit users to simple choices such as like and dislike. However, these coarse-grained signals fail to capture users' nuanced behavior motivations and intentions. In turn, current systems cannot also distinguish which specific item attributes drive user satisfaction or dissatisfaction, resulting in inaccurate preference modeling. These fundamental limitations create a persistent gap between user intentions and system interpretations, ultimately undermining user satisfaction and harming system effectiveness. To address these limitations, we introduce the Interactive Recommendation Feed (IRF), a pioneering paradigm that enables natural language commands within mainstream recommendation feeds. Unlike traditional systems that confine users to passive implicit behavioral influence, IRF empowers active explicit control over recommendation policies through real-time linguistic commands. To support this paradigm, we develop RecBot, a dual-agent architecture where a Parser Agent transforms linguistic expressions into structured preferences and a Planner Agent dynamically orchestrates adaptive tool chains for on-the-fly policy adjustment. To enable practical deployment, we employ simulation-augmented knowledge distillation to achieve efficient performance while maintaining strong reasoning capabilities. Through extensive offline and long-term online experiments, RecBot shows significant improvements in both user satisfaction and business outcomes.
中文: 传统推荐系统依赖被动反馈机制,难以捕捉用户细粒度偏好,而提出的交互式推荐信息流(IRF)通过其RecBot双智能体架构实现了基于自然语言的主动控制,借助实时策略调整显著提升了用户满意度和系统效果。
English: Traditional recommender systems' reliance on passive feedback limits nuanced user preference capture, but the proposed Interactive Recommendation Feed (IRF) with its RecBot dual-agent architecture enables active linguistic control, significantly improving satisfaction and outcomes through real-time policy adjustments.

Authors:Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
Title: FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Abstract:
The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
中文: 该研究发现针对多模态大语言模型的视觉越狱攻击因局限于高锐度区域而对参数变化极为敏感,并提出FORCE方法通过纠正特征过度依赖来提升跨模型可迁移性。
English: The study reveals that visual jailbreaking attacks on multimodal language models are highly sensitive to parameter changes due to their confinement in high-sharpness regions, and introduces the FORCE method to enhance cross-model transferability by correcting feature over-reliance.

Authors:Runqi Lin, Alasdair Paren, Suqin Yuan, Muyang Li, Philip Torr, Adel Bibi, Tongliang Liu
Title: FORCE: Transferable Visual Jailbreaking Attacks via Feature Over-Reliance CorrEction
Abstract:
The integration of new modalities enhances the capabilities of multimodal large language models (MLLMs) but also introduces additional vulnerabilities. In particular, simple visual jailbreaking attacks can manipulate open-source MLLMs more readily than sophisticated textual attacks. However, these underdeveloped attacks exhibit extremely limited cross-model transferability, failing to reliably identify vulnerabilities in closed-source MLLMs. In this work, we analyse the loss landscape of these jailbreaking attacks and find that the generated attacks tend to reside in high-sharpness regions, whose effectiveness is highly sensitive to even minor parameter changes during transfer. To further explain the high-sharpness localisations, we analyse their feature representations in both the intermediate layers and the spectral domain, revealing an improper reliance on narrow layer representations and semantically poor frequency components. Building on this, we propose a Feature Over-Reliance CorrEction (FORCE) method, which guides the attack to explore broader feasible regions across layer features and rescales the influence of frequency features according to their semantic content. By eliminating non-generalizable reliance on both layer and spectral features, our method discovers flattened feasible regions for visual jailbreaking attacks, thereby improving cross-model transferability. Extensive experiments demonstrate that our approach effectively facilitates visual red-teaming evaluations against closed-source MLLMs.
中文: 该研究发现针对多模态大语言模型的视觉越狱攻击因局限于高锐度区域而对参数变化极为敏感,并提出FORCE方法通过纠正特征过度依赖来提升跨模型可迁移性。
English: The study reveals that visual jailbreaking attacks on multimodal language models are highly sensitive to parameter changes due to their confinement in high-sharpness regions, and introduces the FORCE method to enhance cross-model transferability by correcting feature over-reliance.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Plant identification in an open-world (LifeCLEF 2016)
Abstract:
The LifeCLEF plant identification challenge aims at evaluating plant identification methods and systems at a very large scale, close to the conditions of a real-world biodiversity monitoring scenario. The 2016-th edition was actually conducted on a set of more than 110K images illustrating 1000 plant species living in West Europe, built through a large-scale participatory sensing platform initiated in 2011 and which now involves tens of thousands of contributors. The main novelty over the previous years is that the identification task was evaluated as an open-set recognition problem, i.e. a problem in which the recognition system has to be robust to unknown and never seen categories. Beyond the brute-force classification across the known classes of the training set, the big challenge was thus to automatically reject the false positive classification hits that are caused by the unknown classes. This overview presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
中文:LifeCLEF 2016植物识别挑战赛利用超过11万张西欧千种植物的图像评估大规模识别方法,重点研究开放集识别以处理未知类别并排除误判,同时分析了参赛方案和主要成果。
English: The LifeCLEF 2016 plant identification challenge evaluated large-scale methods using over 110,000 images of 1,000 Western European species, focusing on open-set recognition to handle unknown categories and reject false positives, while analyzing participant approaches and outcomes.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Plant identification based on noisy web data: the amazing performance of deep learning (LifeCLEF 2017)
Abstract:
The 2017-th edition of the LifeCLEF plant identification challenge is an important milestone towards automated plant identification systems working at the scale of continental floras with 10.000 plant species living mainly in Europe and North America illustrated by a total of 1.1M images. Nowadays, such ambitious systems are enabled thanks to the conjunction of the dazzling recent progress in image classification with deep learning and several outstanding international initiatives, such as the Encyclopedia of Life (EOL), aggregating the visual knowledge on plant species coming from the main national botany institutes. However, despite all these efforts the majority of the plant species still remain without pictures or are poorly illustrated. Outside the institutional channels, a much larger number of plant pictures are available and spread on the web through botanist blogs, plant lovers web-pages, image hosting websites and on-line plant retailers. The LifeCLEF 2017 plant challenge presented in this paper aimed at evaluating to what extent a large noisy training dataset collected through the web and containing a lot of labelling errors can compete with a smaller but trusted training dataset checked by experts. To fairly compare both training strategies, the test dataset was created from a third data source, i.e. the Pl@ntNet mobile application that collects millions of plant image queries all over the world. This paper presents more precisely the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
Chinese: LifeCLEF 2017植物识别挑战旨在评估大规模网络收集的含噪数据集能否与小型专家验证数据集相竞争,通过Pl@ntNet应用的测试集对两种训练策略进行公平比较。
English: The LifeCLEF 2017 plant identification challenge evaluates whether a large but noisy web-collected dataset can compete with a smaller, expert-verified dataset, using a test set from the Pl@ntNet app to compare both strategies.

Authors:Junjie Cui, Peilong Wang, Jason Holmes, Leshan Sun, Michael L. Hinni, Barbara A. Pockaj, Sujay A. Vora, Terence T. Sio, William W. Wong, Nathan Y. Yu, Steven E. Schild, Joshua R. Niska, Sameer R. Keole, Jean-Claude M. Rwigema, Samir H. Patel, Lisa A. McGee, Carlos A. Vargas, Wei Liu
Title: An Automated Retrieval-Augmented Generation LLaMA-4 109B-based System for Evaluating Radiotherapy Treatment Plans
Abstract:
Purpose: To develop a retrieval-augmented generation (RAG) system powered by LLaMA-4 109B for automated, protocol-aware, and interpretable evaluation of radiotherapy treatment plans. Methods and Materials: We curated a multi-protocol dataset of 614 radiotherapy plans across four disease sites and constructed a knowledge base containing normalized dose metrics and protocol-defined constraints. The RAG system integrates three core modules: a retrieval engine optimized across five SentenceTransformer backbones, a percentile prediction component based on cohort similarity, and a clinical constraint checker. These tools are directed by a large language model (LLM) using a multi-step prompt-driven reasoning pipeline to produce concise, grounded evaluations. Results: Retrieval hyperparameters were optimized using Gaussian Process on a scalarized loss function combining root mean squared error (RMSE), mean absolute error (MAE), and clinically motivated accuracy thresholds. The best configuration, based on all-MiniLM-L6-v2, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt MAE. When tested end-to-end, the RAG system achieved 100% agreement with the computed values by standalone retrieval and constraint-checking modules on both percentile estimates and constraint identification, confirming reliable execution of all retrieval, prediction and checking steps. Conclusion: Our findings highlight the feasibility of combining structured population-based scoring with modular tool-augmented reasoning for transparent, scalable plan evaluation in radiation therapy. The system offers traceable outputs, minimizes hallucination, and demonstrates robustness across protocols. Future directions include clinician-led validation, and improved domain-adapted retrieval models to enhance real-world integration.
中文摘要:本研究开发了基于LLaMA-4 109B的检索增强生成系统,用于自动化评估放疗计划,在百分位预测和临床约束验证方面实现高精度,同时确保结果可解释。
English Summary: This study developed a retrieval-augmented generation system using LLaMA-4 109B to automate radiotherapy plan evaluation, achieving high accuracy in percentile predictions and clinical constraint verification while ensuring interpretable outputs.

Authors:Junjie Cui, Peilong Wang, Jason Holmes, Leshan Sun, Michael L. Hinni, Barbara A. Pockaj, Sujay A. Vora, Terence T. Sio, William W. Wong, Nathan Y. Yu, Steven E. Schild, Joshua R. Niska, Sameer R. Keole, Jean-Claude M. Rwigema, Samir H. Patel, Lisa A. McGee, Carlos A. Vargas, Wei Liu
Title: An Automated Retrieval-Augmented Generation LLaMA-4 109B-based System for Evaluating Radiotherapy Treatment Plans
Abstract:
Purpose: To develop a retrieval-augmented generation (RAG) system powered by LLaMA-4 109B for automated, protocol-aware, and interpretable evaluation of radiotherapy treatment plans. Methods and Materials: We curated a multi-protocol dataset of 614 radiotherapy plans across four disease sites and constructed a knowledge base containing normalized dose metrics and protocol-defined constraints. The RAG system integrates three core modules: a retrieval engine optimized across five SentenceTransformer backbones, a percentile prediction component based on cohort similarity, and a clinical constraint checker. These tools are directed by a large language model (LLM) using a multi-step prompt-driven reasoning pipeline to produce concise, grounded evaluations. Results: Retrieval hyperparameters were optimized using Gaussian Process on a scalarized loss function combining root mean squared error (RMSE), mean absolute error (MAE), and clinically motivated accuracy thresholds. The best configuration, based on all-MiniLM-L6-v2, achieved perfect nearest-neighbor accuracy within a 5-percentile-point margin and a sub-2pt MAE. When tested end-to-end, the RAG system achieved 100% agreement with the computed values by standalone retrieval and constraint-checking modules on both percentile estimates and constraint identification, confirming reliable execution of all retrieval, prediction and checking steps. Conclusion: Our findings highlight the feasibility of combining structured population-based scoring with modular tool-augmented reasoning for transparent, scalable plan evaluation in radiation therapy. The system offers traceable outputs, minimizes hallucination, and demonstrates robustness across protocols. Future directions include clinician-led validation, and improved domain-adapted retrieval models to enhance real-world integration.
中文摘要:本研究开发了基于LLaMA-4 109B的检索增强生成系统,用于自动化评估放疗计划,在百分位预测和临床约束验证方面实现高精度,同时确保结果可解释。
English Summary: This study developed a retrieval-augmented generation system using LLaMA-4 109B to automate radiotherapy plan evaluation, achieving high accuracy in percentile predictions and clinical constraint verification while ensuring interpretable outputs.

Authors:Elias N. Zois, Moises Diaz, Salem Said, Miguel A. Ferrer
Title: Quasi-Synthetic Riemannian Data Generation for Writer-Independent Offline Signature Verification
Abstract:
Offline handwritten signature verification remains a challenging task, particularly in writer-independent settings where models must generalize across unseen individuals. Recent developments have highlighted the advantage of geometrically inspired representations, such as covariance descriptors on Riemannian manifolds. However, past or present, handcrafted or data-driven methods usually depend on real-world signature datasets for classifier training. We introduce a quasi-synthetic data generation framework leveraging the Riemannian geometry of Symmetric Positive Definite matrices (SPD). A small set of genuine samples in the SPD space is the seed to a Riemannian Gaussian Mixture which identifies Riemannian centers as synthetic writers and variances as their properties. Riemannian Gaussian sampling on each center generates positive as well as negative synthetic SPD populations. A metric learning framework utilizes pairs of similar and dissimilar SPD points, subsequently testing it over on real-world datasets. Experiments conducted on two popular signature datasets, encompassing Western and Asian writing styles, demonstrate the efficacy of the proposed approach under both intra- and cross- dataset evaluation protocols. The results indicate that our quasi-synthetic approach achieves low error rates, highlighting the potential of generating synthetic data in Riemannian spaces for writer-independent signature verification systems.
Chinese: 本文提出了一种利用黎曼几何生成准合成数据的框架,通过在对称正定矩阵空间创建合成签名样本,在跨数据集的独立书写者离线手写签名验证中实现了低错误率。
English: This paper proposes a quasi-synthetic data generation framework using Riemannian geometry to create synthetic signature samples, which achieves low error rates in writer-independent offline handwritten signature verification across different datasets.

Authors:Chenhao Ji, Chaohui Yu, Junyao Gao, Fan Wang, Cairong Zhao
Title: CamPVG: Camera-Controlled Panoramic Video Generation with Epipolar-Aware Diffusion
Abstract:
Recently, camera-controlled video generation has seen rapid development, offering more precise control over video generation. However, existing methods predominantly focus on camera control in perspective projection video generation, while geometrically consistent panoramic video generation remains challenging. This limitation is primarily due to the inherent complexities in panoramic pose representation and spherical projection. To address this issue, we propose CamPVG, the first diffusion-based framework for panoramic video generation guided by precise camera poses. We achieve camera position encoding for panoramic images and cross-view feature aggregation based on spherical projection. Specifically, we propose a panoramic Plücker embedding that encodes camera extrinsic parameters through spherical coordinate transformation. This pose encoder effectively captures panoramic geometry, overcoming the limitations of traditional methods when applied to equirectangular projections. Additionally, we introduce a spherical epipolar module that enforces geometric constraints through adaptive attention masking along epipolar lines. This module enables fine-grained cross-view feature aggregation, substantially enhancing the quality and consistency of generated panoramic videos. Extensive experiments demonstrate that our method generates high-quality panoramic videos consistent with camera trajectories, far surpassing existing methods in panoramic video generation.
中文摘要:提出的CamPVG框架首次实现了基于扩散模型的精确相机位姿引导全景视频生成,通过球面坐标编码和极线约束突破传统方法局限,显著提升了生成视频的质量与几何一致性。
English Summary: The proposed CamPVG framework introduces a diffusion-based approach for panoramic video generation using precise camera pose guidance, overcoming challenges in panoramic geometry through novel spherical coordinate encoding and epipolar constraints to produce superior quality videos.

Authors:Yang Cui, Peter Pan, Lei He, Sheng Zhao
Title: Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
Abstract:
With the rapid advancement of speech generative models, unauthorized voice cloning poses significant privacy and security risks. Speech watermarking offers a viable solution for tracing sources and preventing misuse. Current watermarking technologies fall mainly into two categories: DSP-based methods and deep learning-based methods. DSP-based methods are efficient but vulnerable to attacks, whereas deep learning-based methods offer robust protection at the expense of significantly higher computational cost. To improve the computational efficiency and enhance the robustness, we propose PKDMark, a lightweight deep learning-based speech watermarking method that leverages progressive knowledge distillation (PKD). Our approach proceeds in two stages: (1) training a high-performance teacher model using an invertible neural network-based architecture, and (2) transferring the teacher's capabilities to a compact student model through progressive knowledge distillation. This process reduces computational costs by 93.6% while maintaining high level of robust performance and imperceptibility. Experimental results demonstrate that our distilled model achieves an average detection F1 score of 99.6% with a PESQ of 4.30 in advanced distortions, enabling efficient speech watermarking for real-time speech synthesis applications.
Chinese: 针对现有语音水印技术易受攻击和高计算成本的问题,我们提出PKDMark——一种基于渐进式知识蒸馏的轻量级深度学习语音水印方法,在保持强鲁棒性和不可感知性的同时,将计算成本降低93.6%,适用于实时语音合成场景。
English: To address the vulnerabilities and high computational costs of existing speech watermarking methods, we propose PKDMark, a lightweight deep learning-based approach using progressive knowledge distillation that reduces computational costs by 93.6% while maintaining robust performance and imperceptibility for real-time applications.

Authors:Yang Cui, Peter Pan, Lei He, Sheng Zhao
Title: Efficient Speech Watermarking for Speech Synthesis via Progressive Knowledge Distillation
Abstract:
With the rapid advancement of speech generative models, unauthorized voice cloning poses significant privacy and security risks. Speech watermarking offers a viable solution for tracing sources and preventing misuse. Current watermarking technologies fall mainly into two categories: DSP-based methods and deep learning-based methods. DSP-based methods are efficient but vulnerable to attacks, whereas deep learning-based methods offer robust protection at the expense of significantly higher computational cost. To improve the computational efficiency and enhance the robustness, we propose PKDMark, a lightweight deep learning-based speech watermarking method that leverages progressive knowledge distillation (PKD). Our approach proceeds in two stages: (1) training a high-performance teacher model using an invertible neural network-based architecture, and (2) transferring the teacher's capabilities to a compact student model through progressive knowledge distillation. This process reduces computational costs by 93.6% while maintaining high level of robust performance and imperceptibility. Experimental results demonstrate that our distilled model achieves an average detection F1 score of 99.6% with a PESQ of 4.30 in advanced distortions, enabling efficient speech watermarking for real-time speech synthesis applications.
Chinese: 针对现有语音水印技术易受攻击和高计算成本的问题,我们提出PKDMark——一种基于渐进式知识蒸馏的轻量级深度学习语音水印方法,在保持强鲁棒性和不可感知性的同时,将计算成本降低93.6%,适用于实时语音合成场景。
English: To address the vulnerabilities and high computational costs of existing speech watermarking methods, we propose PKDMark, a lightweight deep learning-based approach using progressive knowledge distillation that reduces computational costs by 93.6% while maintaining robust performance and imperceptibility for real-time applications.

Authors:Pei Zhang, Andong Chen, Xi Chen, Baosong Yang, Derek F. Wong, Fei Huang
Title: PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs
Abstract:
Large language models (LLMs) have expanded from text to speech, giving rise to Speech Large Models (SLMs) that support recognition, translation, and synthesis. A key challenge is aligning speech and text representations, which becomes harder in multilingual settings. Existing methods often freeze LLM parameters and train encoders on multilingual data, but this forces cross-language convergence and limits performance. We introduce Progressive Alignment Representation Training (PART), a multi-stage and multi-task framework that separates within-language from cross-language alignment. During cross-language training, LLM parameters are dynamically activated, and text-based tasks are later introduced to enhance multilingual understanding. Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches, with analysis confirming its ability to balance language-specific distinctions and cross-language generalization. These results demonstrate PART's effectiveness and generality for multilingual speech modality alignment.
中文摘要:PART通过多阶段训练框架分别处理语内与跨语言对齐,并动态激活大语言模型参数,在多个数据集上超越了传统方法,有效平衡了语言特性与跨语言泛化能力。
English Summary: PART is a multi-stage framework that improves multilingual speech-text alignment by separating within-language and cross-language training while dynamically activating LLM parameters, outperforming conventional methods across multiple datasets.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Overview of LifeCLEF Plant Identification task 2020
Abstract:
Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data with more and more photos in the field. However, this profusion of data only concerns a few tens of thousands of species, mostly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have collected, catalogued and systematically stored plant specimens in herbaria, particularly in tropical regions, and the recent efforts by the biodiversity informatics community made it possible to put millions of digitized sheets online. The LifeCLEF 2020 Plant Identification challenge (or "PlantCLEF 2020") was designed to evaluate to what extent automated identification on the flora of data deficient regions can be improved by the use of herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the South America's Guiana Shield, an area known to have one of the greatest diversity of plants in the world. The challenge was evaluated as a cross-domain classification task where the training set consist of several hundred thousand herbarium sheets and few thousand of photos to enable learning a mapping between the two domains. The test set was exclusively composed of photos in the field. This paper presents the resources and assessments of the conducted evaluation, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
中文: LifeCLEF 2020植物识别挑战赛旨在通过利用数百万数字化植物标本进行训练,提升对生物多样性丰富但数据匮乏地区的植物自动识别能力,测试集仅包含野外照片以评估跨领域分类效果。
English: The LifeCLEF 2020 Plant Identification challenge aimed to enhance automated plant recognition in biodiverse but data-scarce regions by utilizing millions of digitized herbarium specimens for training, with the test set consisting solely of field photos to evaluate cross-domain classification performance.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Overview of LifeCLEF Plant Identification task 2019: diving into data deficient tropical countries
Abstract:
Automated identification of plants has improved considerably thanks to the recent progress in deep learning and the availability of training data. However, this profusion of data only concerns a few tens of thousands of species, while the planet has nearly 369K. The LifeCLEF 2019 Plant Identification challenge (or "PlantCLEF 2019") was designed to evaluate automated identification on the flora of data deficient regions. It is based on a dataset of 10K species mainly focused on the Guiana shield and the Northern Amazon rainforest, an area known to have one of the greatest diversity of plants and animals in the world. As in the previous edition, a comparison of the performance of the systems evaluated with the best tropical flora experts was carried out. This paper presents the resources and assessments of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of the main outcomes.
中文: 深度学习进展提升了植物自动识别能力,但地球近36.9万物种中大多仍缺乏数据,为此LifeCLEF 2019竞赛聚焦主亚那地盾的1万种植物评估AI系统,并将结果与顶尖植物学家进行对比分析。
English: Deep learning advancements have enhanced automated plant identification, yet data scarcity persists for most of Earth's 369,000 species, prompting the LifeCLEF 2019 challenge to evaluate AI performance on 10,000 Guiana Shield species while comparing results with expert botanists.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Overview of PlantCLEF 2021: cross-domain plant identification
Abstract:
Automated plant identification has improved considerably thanks to recent advances in deep learning and the availability of training data with more and more field photos. However, this profusion of data concerns only a few tens of thousands of species, mainly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have systematically collected, catalogued and stored plant specimens in herbaria, especially in tropical regions, and recent efforts by the biodiversity informatics community have made it possible to put millions of digitised records online. The LifeCLEF 2021 plant identification challenge (or "PlantCLEF 2021") was designed to assess the extent to which automated identification of flora in data-poor regions can be improved by using herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the Guiana Shield of South America, a region known to have one of the highest plant diversities in the world. The challenge was evaluated as a cross-domain classification task where the training set consisted of several hundred thousand herbarium sheets and a few thousand photos to allow learning a correspondence between the two domains. In addition to the usual metadata (location, date, author, taxonomy), the training data also includes the values of 5 morphological and functional traits for each species. The test set consisted exclusively of photos taken in the field. This article presents the resources and evaluations of the assessment carried out, summarises the approaches and systems used by the participating research groups and provides an analysis of the main results.
中文: 尽管深度学习推动了植物自动识别的发展,但生物多样性丰富地区的数据仍显不足,因此LifeCLEF 2021竞赛利用植物标本馆藏品,旨在提升如圭亚那地盾等区域的物种识别能力。
English: Automated plant identification has advanced through deep learning, yet data scarcity in biodiverse regions persists, prompting the LifeCLEF 2021 challenge to leverage herbarium collections for improving species recognition in areas like the Guiana Shield.

Authors:Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang
Title: Sparse Training Scheme for Multimodal LLM
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient due to the significantly longer input sequences introduced by multimodal data and the low utilization of inter-layer computations. To address this challenge, we shift the focus to the training process itself and propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). This scheme consists of two key components: the Visual Token Compressor, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by dynamically skipping unnecessary layers in the language model during both forward and backward passes. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.
中文: 稀疏训练方案通过压缩视觉标记和动态跳过不必要的语言模型层,显著提高了多模态大语言模型的训练效率,并在多种架构和基准测试中验证了其有效性。
English: The Sparse Training Scheme (STS) enhances the efficiency of training Multimodal Large Language Models by compressing visual tokens and dynamically skipping unnecessary layers, proving effective across various architectures and benchmarks.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Overview of PlantCLEF 2022: Image-based plant identification at global scale
Abstract:
It is estimated that there are more than 300,000 species of vascular plants in the world. Increasing our knowledge of these species is of paramount importance for the development of human civilization (agriculture, construction, pharmacopoeia, etc.), especially in the context of the biodiversity crisis. However, the burden of systematic plant identification by human experts strongly penalizes the aggregation of new data and knowledge. Since then, automatic identification has made considerable progress in recent years as highlighted during all previous editions of PlantCLEF. Deep learning techniques now seem mature enough to address the ultimate but realistic problem of global identification of plant biodiversity in spite of many problems that the data may present (a huge number of classes, very strongly unbalanced classes, partially erroneous identifications, duplications, variable visual quality, diversity of visual contents such as photos or herbarium sheets, etc). The PlantCLEF2022 challenge edition proposes to take a step in this direction by tackling a multi-image (and metadata) classification problem with a very large number of classes (80k plant species). This paper presents the resources and evaluations of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of key findings.
中文: PlantCLEF2022挑战赛通过采用深度学习技术处理8万种植物的多图像和元数据分类,旨在解决植物多样性识别中的数据难题,推动自动识别技术发展以应对生物多样性危机。
English: The PlantCLEF2022 challenge addresses the critical need for automated plant species identification by leveraging deep learning to classify 80,000 species using multi-image and metadata, overcoming data challenges to advance biodiversity knowledge.

Authors:Herve Goeau, Pierre Bonnet, Alexis Joly
Title: Overview of PlantCLEF 2023: Image-based Plant Identification at Global Scale
Abstract:
The world is estimated to be home to over 300,000 species of vascular plants. In the face of the ongoing biodiversity crisis, expanding our understanding of these species is crucial for the advancement of human civilization, encompassing areas such as agriculture, construction, and pharmacopoeia. However, the labor-intensive process of plant identification undertaken by human experts poses a significant obstacle to the accumulation of new data and knowledge. Fortunately, recent advancements in automatic identification, particularly through the application of deep learning techniques, have shown promising progress. Despite challenges posed by data-related issues such as a vast number of classes, imbalanced class distribution, erroneous identifications, duplications, variable visual quality, and diverse visual contents (such as photos or herbarium sheets), deep learning approaches have reached a level of maturity which gives us hope that in the near future we will have an identification system capable of accurately identifying all plant species worldwide. The PlantCLEF2023 challenge aims to contribute to this pursuit by addressing a multi-image (and metadata) classification problem involving an extensive set of classes (80,000 plant species). This paper provides an overview of the challenge's resources and evaluations, summarizes the methods and systems employed by participating research groups, and presents an analysis of key findings.
Chinese: 面对全球丰富的维管植物多样性,深度学习技术通过应对数据挑战为物种识别提供了可行路径,PlantCLEF2023竞赛正是推动实现全球植物精准识别的重要实践。
English: The world's vast diversity of vascular plants necessitates advanced identification methods, with deep learning offering promising solutions to overcome data challenges and achieve global species recognition, as demonstrated by the PlantCLEF2023 challenge.

Authors:Giulio Martellucci, Herve Goeau, Pierre Bonnet, Fabrice Vinatier, Alexis Joly
Title: Overview of PlantCLEF 2025: Multi-Species Plant Identification in Vegetation Quadrat Images
Abstract:
Quadrat images are essential for ecological studies, as they enable standardized sampling, the assessment of plant biodiversity, long-term monitoring, and large-scale field campaigns. These images typically cover an area of fifty centimetres or one square meter, and botanists carefully identify all the species present. Integrating AI could help specialists accelerate their inventories and expand the spatial coverage of ecological studies. To assess progress in this area, the PlantCLEF 2025 challenge relies on a new test set of 2,105 high-resolution multi-label images annotated by experts and covering around 400 species. It also provides a large training set of 1.4 million individual plant images, along with vision transformer models pre-trained on this data. The task is formulated as a (weakly labelled) multi-label classification problem, where the goal is to predict all species present in a quadrat image using single-label training data. This paper provides a detailed description of the data, the evaluation methodology, the methods and models used by participants, and the results achieved.
中文: PlantCLEF 2025挑战赛通过提供大规模数据集和预训练视觉Transformer模型,推进人工智能在生态研究中的应用,实现样方图像中多标签物种识别,从而加速植物清查并扩大研究范围。
English: The PlantCLEF 2025 challenge introduces a large dataset and pre-trained vision transformer models to advance AI-assisted ecological studies by enabling multi-label species identification in quadrat images, thereby accelerating plant inventories and expanding research coverage.

Authors:Junhao Jia, Yunyou Liu, Yifei Sun, Huangwei Chen, Feiwei Qin, Changmiao Wang, Yong Peng
Title: Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition
Abstract:
Nonlinear manifolds are widespread in deep visual features, where Euclidean distances often fail to capture true similarity. This limitation becomes particularly severe in prototype-based interpretable fine-grained recognition, where subtle semantic distinctions are essential. To address this challenge, we propose a novel paradigm for prototype-based recognition that anchors similarity within the intrinsic geometry of deep features. Specifically, we distill the latent manifold structure of each class into a diffusion space and introduce a differentiable Nyström interpolation, making the geometry accessible to both unseen samples and learnable prototypes. To ensure efficiency, we employ compact per-class landmark sets with periodic updates. This design keeps the embedding aligned with the evolving backbone, enabling fast and scalable inference. Extensive experiments on the CUB-200-2011 and Stanford Cars datasets show that our GeoProto framework produces prototypes focusing on semantically aligned parts, significantly outperforming Euclidean prototype networks.
Chinese: GeoProto框架通过利用扩散空间和可微分Nyström插值的内在流形几何,解决了深度视觉特征中欧氏距离的局限性,在细粒度识别任务中以语义对齐的原型显著提升了性能。
English: The GeoProto framework addresses the limitations of Euclidean distances in deep visual features by leveraging intrinsic manifold geometry through diffusion spaces and differentiable Nyström interpolation, achieving superior performance in fine-grained recognition with semantically aligned prototypes.

Authors:Yongrui Chen, Yi Huang, Yunchang Liu, Shenyu Zhang, Junhao He, Tongtong Wu, Guilin Qi, Tianxing Wu
Title: K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling
Abstract:
Continual Structured Knowledge Reasoning (CSKR) focuses on training models to handle sequential tasks, where each task involves translating natural language questions into structured queries grounded in structured knowledge. Existing general continual learning approaches face significant challenges when applied to this task, including poor generalization to heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase. To address these limitations, we propose a novel CSKR framework, \textsc{K-DeCore}, which operates with a fixed number of tunable parameters. Unlike prior methods, \textsc{K-DeCore} introduces a knowledge decoupling mechanism that disentangles the reasoning process into task-specific and task-agnostic stages, effectively bridging the gaps across diverse tasks. Building on this foundation, \textsc{K-DeCore} integrates a dual-perspective memory consolidation mechanism for distinct stages and introduces a structure-guided pseudo-data synthesis strategy to further enhance the model's generalization capabilities. Extensive experiments on four benchmark datasets demonstrate the superiority of \textsc{K-DeCore} over existing continual learning methods across multiple metrics, leveraging various backbone large language models.
中文: 提出的\textsc{K-DeCore}框架通过将推理解耦为任务特定和任务无关阶段并采用固定参数,解决了持续结构化知识推理中的关键挑战,在多个基准测试中展现出卓越性能。
English: The proposed \textsc{K-DeCore} framework addresses challenges in Continual Structured Knowledge Reasoning by decoupling reasoning into task-specific and task-agnostic stages with fixed parameters, demonstrating superior performance across multiple benchmarks.

Authors:Yongrui Chen, Yi Huang, Yunchang Liu, Shenyu Zhang, Junhao He, Tongtong Wu, Guilin Qi, Tianxing Wu
Title: K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling
Abstract:
Continual Structured Knowledge Reasoning (CSKR) focuses on training models to handle sequential tasks, where each task involves translating natural language questions into structured queries grounded in structured knowledge. Existing general continual learning approaches face significant challenges when applied to this task, including poor generalization to heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase. To address these limitations, we propose a novel CSKR framework, \textsc{K-DeCore}, which operates with a fixed number of tunable parameters. Unlike prior methods, \textsc{K-DeCore} introduces a knowledge decoupling mechanism that disentangles the reasoning process into task-specific and task-agnostic stages, effectively bridging the gaps across diverse tasks. Building on this foundation, \textsc{K-DeCore} integrates a dual-perspective memory consolidation mechanism for distinct stages and introduces a structure-guided pseudo-data synthesis strategy to further enhance the model's generalization capabilities. Extensive experiments on four benchmark datasets demonstrate the superiority of \textsc{K-DeCore} over existing continual learning methods across multiple metrics, leveraging various backbone large language models.
中文: 提出的\textsc{K-DeCore}框架通过将推理解耦为任务特定和任务无关阶段并采用固定参数,解决了持续结构化知识推理中的关键挑战,在多个基准测试中展现出卓越性能。
English: The proposed \textsc{K-DeCore} framework addresses challenges in Continual Structured Knowledge Reasoning by decoupling reasoning into task-specific and task-agnostic stages with fixed parameters, demonstrating superior performance across multiple benchmarks.

Authors:Qiang Xiang, Shuang Sun, Binglei Li, Dejia Song, Huaxia Li, Nemo Chen, Xu Tang, Yao Hu, Junping Zhang
Title: InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention
Abstract:
Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules.
中文:提出的InstanceAssemble架构通过实例组装注意力和轻量级LoRA模块整合布局条件,在实现最优性能的同时引入了新基准和评估指标,显著提升了布局到图像的生成效果。
English: The proposed InstanceAssemble architecture enhances Layout-to-Image generation by integrating layout conditions through instance-assembling attention and lightweight LoRA modules, achieving state-of-the-art performance while introducing a new benchmark and evaluation metric.

Authors:Yue Ma, Zexuan Yan, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Zhifeng Li, Wei Liu, Linfeng Zhang, Qifeng Chen
Title: Follow-Your-Emoji-Faster: Towards Efficient, Fine-Controllable, and Expressive Freestyle Portrait Animation
Abstract:
We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in https://follow-your-emoji.github.io/.
中文: 我们提出Follow-Your-Emoji-Faster框架,通过增强运动信号和面部损失,在保持身份特征和表情精度的同时,高效生成多样化肖像动画,并实现无损加速,展现出卓越的动画质量与控制性。
English: We introduce Follow-Your-Emoji-Faster, an efficient diffusion-based framework that uses facial landmarks to animate diverse portraits while preserving identity and expression accuracy through enhanced motion signals and facial loss, achieving high-quality results with accelerated performance.

Authors:Tianyi Yan, Wencheng Han, Xia Zhou, Xueyang Zhang, Kun Zhan, Cheng-zhong Xu, Jianbing Shen
Title: RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation
Abstract:
Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e.g., VP error by 21\%, Depth error by 57\%) and dramatically improves 3D object detection mAP by 12.7\%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.
中文: 本研究提出带几何反馈的强化学习(RLGF),通过分层奖励机制修正自动驾驶合成视频中的几何失真,显著提升3D物体检测精度,有效缩小与真实数据间的性能差距。
English: The study introduces Reinforcement Learning with Geometric Feedback (RLGF) to address geometric distortions in synthetic autonomous driving videos, significantly improving 3D object detection accuracy by incorporating hierarchical rewards and reducing performance gaps with real data.

Authors:Rikuto Kotoge, Zheng Chen, Tasuku Kimura, Yasuko Matsubara, Takufumi Yanagisawa, Haruhiko Kishima, Yasushi Sakurai
Title: EvoBrain: Dynamic Multi-channel EEG Graph Modeling for Time-evolving Brain Network
Abstract:
Dynamic GNNs, which integrate temporal and spatial features in Electroencephalography (EEG) data, have shown great potential in automating seizure detection. However, fully capturing the underlying dynamics necessary to represent brain states, such as seizure and non-seizure, remains a non-trivial task and presents two fundamental challenges. First, most existing dynamic GNN methods are built on temporally fixed static graphs, which fail to reflect the evolving nature of brain connectivity during seizure progression. Second, current efforts to jointly model temporal signals and graph structures and, more importantly, their interactions remain nascent, often resulting in inconsistent performance. To address these challenges, we present the first theoretical analysis of these two problems, demonstrating the effectiveness and necessity of explicit dynamic modeling and time-then-graph dynamic GNN method. Building on these insights, we propose EvoBrain, a novel seizure detection model that integrates a two-stream Mamba architecture with a GCN enhanced by Laplacian Positional Encoding, following neurological insights. Moreover, EvoBrain incorporates explicitly dynamic graph structures, allowing both nodes and edges to evolve over time. Our contributions include (a) a theoretical analysis proving the expressivity advantage of explicit dynamic modeling and time-then-graph over other approaches, (b) a novel and efficient model that significantly improves AUROC by 23% and F1 score by 30%, compared with the dynamic GNN baseline, and (c) broad evaluations of our method on the challenging early seizure prediction tasks.
中文: 动态图神经网络在利用脑电图数据自动检测癫痫方面潜力巨大,但难以捕捉大脑连接性的动态演变以及时序信号与图结构的交互作用,因此我们提出了EvoBrain模型,通过显式动态建模和双流Mamba架构显著提升了检测性能。
English: Dynamic GNNs show promise for automated seizure detection from EEG data but face challenges in capturing evolving brain connectivity and modeling interactions between temporal signals and graph structures, leading to the development of EvoBrain, a novel model that significantly improves performance through explicit dynamic modeling and a two-stream Mamba architecture.

Authors:Herve Goeau, Vincent Espitalier, Pierre Bonnet, Alexis Joly
Title: Overview of PlantCLEF 2024: multi-species plant identification in vegetation plot images
Abstract:
Plot images are essential for ecological studies, enabling standardized sampling, biodiversity assessment, long-term monitoring and remote, large-scale surveys. Plot images are typically fifty centimetres or one square meter in size, and botanists meticulously identify all the species found there. The integration of AI could significantly improve the efficiency of specialists, helping them to extend the scope and coverage of ecological studies. To evaluate advances in this regard, the PlantCLEF 2024 challenge leverages a new test set of thousands of multi-label images annotated by experts and covering over 800 species. In addition, it provides a large training set of 1.7 million individual plant images as well as state-of-the-art vision transformer models pre-trained on this data. The task is evaluated as a (weakly-labeled) multi-label classification task where the aim is to predict all the plant species present on a high-resolution plot image (using the single-label training data). In this paper, we provide an detailed description of the data, the evaluation methodology, the methods and models employed by the participants and the results achieved.
Chinese: PlantCLEF 2024挑战赛利用人工智能和大规模专家标注的样地图像数据集,推动多标签植物物种分类技术的发展,以提升生态学研究的效率和覆盖范围。
English: The PlantCLEF 2024 challenge utilizes AI and a large dataset of expert-annotated plot images to advance multi-label plant species classification, aiming to enhance the efficiency and scope of ecological studies.

Authors:Ke Wang, Wenning Wei, Yan Deng, Lei He, Sheng Zhao
Title: Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment
Abstract:
Automatic Pronunciation Assessment (APA) is critical for Computer-Assisted Language Learning (CALL), requiring evaluation across multiple granularities and aspects. Large Multimodal Models (LMMs) present new opportunities for APA, but their effectiveness in fine-grained assessment remains uncertain. This work investigates fine-tuning LMMs for APA using the Speechocean762 dataset and a private corpus. Fine-tuning significantly outperforms zero-shot settings and achieves competitive results on single-granularity tasks compared to public and commercial systems. The model performs well at word and sentence levels, while phoneme-level assessment remains challenging. We also observe that the Pearson Correlation Coefficient (PCC) reaches 0.9, whereas Spearman's rank Correlation Coefficient (SCC) remains around 0.6, suggesting that SCC better reflects ordinal consistency. These findings highlight both the promise and limitations of LMMs for APA and point to future work on fine-grained modeling and rank-aware evaluation.
中文: 通过微调大型多模态模型显著提升了自动发音评估性能,在词句层面表现优异但音素评估仍具挑战,同时发现斯皮尔曼相关系数更能反映排序一致性,为未来细粒度建模指明了方向。
English: Fine-tuning Large Multimodal Models significantly improves automatic pronunciation assessment, achieving competitive results at word and sentence levels while highlighting challenges at the phoneme level and the need for rank-aware evaluation metrics.

Authors:Pei Zhang, Yiming Wang, Jialong Tang, Baosong Yang, Rui Wang, Derek F. Wong, Fei Huang
Title: Direct Simultaneous Translation Activation for Large Audio-Language Models
Abstract:
Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time, outputting translations while receiving source speech input, rather than waiting for the entire utterance to be spoken. Simul-S2TT research often modifies model architectures to implement read-write strategies. However, with the rise of large audio-language models (LALMs), a key challenge is how to directly activate Simul-S2TT capabilities in base models without additional architectural changes. In this paper, we introduce {\bf Simul}taneous {\bf S}elf-{\bf A}ugmentation ({\bf SimulSA}), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data by randomly truncating speech and constructing partially aligned translation. By incorporating them into offline SFT data, SimulSA effectively bridges the distribution gap between offline translation during pretraining and simultaneous translation during inference. Experimental results demonstrate that augmenting only about {\bf 1\%} of the simultaneous data, compared to the full offline SFT data, can significantly activate LALMs' Simul-S2TT capabilities without modifications to model architecture or decoding strategy.
中文摘要:SimulSA是一种创新策略,利用大型音频语言模型的内在能力,通过截断语音和构建部分对齐翻译来生成同步语音转文本数据,无需修改模型架构即可有效弥合离线预训练与实时推理之间的分布差距。
English Summary: SimulSA is a novel strategy that leverages large audio-language models' inherent abilities to generate simultaneous speech-to-text translation data by truncating speech and creating partial translations, effectively bridging the gap between offline pretraining and real-time inference without requiring architectural changes.

Authors:Junhao Jia, Yunyou Liu, Cheng Yang, Yifei Sun, Feiwei Qin, Changmiao Wang, Yong Peng
Title: Brain-HGCN: A Hyperbolic Graph Convolutional Network for Brain Functional Network Analysis
Abstract:
Functional magnetic resonance imaging (fMRI) provides a powerful non-invasive window into the brain's functional organization by generating complex functional networks, typically modeled as graphs. These brain networks exhibit a hierarchical topology that is crucial for cognitive processing. However, due to inherent spatial constraints, standard Euclidean GNNs struggle to represent these hierarchical structures without high distortion, limiting their clinical performance. To address this limitation, we propose Brain-HGCN, a geometric deep learning framework based on hyperbolic geometry, which leverages the intrinsic property of negatively curved space to model the brain's network hierarchy with high fidelity. Grounded in the Lorentz model, our model employs a novel hyperbolic graph attention layer with a signed aggregation mechanism to distinctly process excitatory and inhibitory connections, ultimately learning robust graph-level representations via a geometrically sound Fréchet mean for graph readout. Experiments on two large-scale fMRI datasets for psychiatric disorder classification demonstrate that our approach significantly outperforms a wide range of state-of-the-art Euclidean baselines. This work pioneers a new geometric deep learning paradigm for fMRI analysis, highlighting the immense potential of hyperbolic GNNs in the field of computational psychiatry.
中文摘要:提出的Brain-HGCN框架利用双曲几何精确建模fMRI数据中的层次化脑网络,在精神疾病分类任务中显著优于传统欧几里得方法。
English Summary: The proposed Brain-HGCN framework utilizes hyperbolic geometry to accurately model hierarchical brain networks from fMRI data, significantly outperforming Euclidean methods in psychiatric disorder classification.

Authors:Xian Gao, Zongyun Zhang, Ting Liu, Yuzhuo Fu
Title: OnlineMate: An LLM-Based Multi-Agent Companion System for Cognitive Support in Online Learning
Abstract:
In online learning environments, students often lack personalized peer interactions, which play a crucial role in supporting cognitive development and learning engagement. Although previous studies have utilized large language models (LLMs) to simulate interactive dynamic learning environments for students, these interactions remain limited to conversational exchanges, lacking insights and adaptations to the learners' individualized learning and cognitive states. As a result, students' interest in discussions with AI learning companions is low, and they struggle to gain inspiration from such interactions. To address this challenge, we propose OnlineMate, a multi-agent learning companion system driven by LLMs that integrates the Theory of Mind (ToM). OnlineMate is capable of simulating peer-like agent roles, adapting to learners' cognitive states during collaborative discussions, and inferring their psychological states, such as misunderstandings, confusion, or motivation. By incorporating Theory of Mind capabilities, the system can dynamically adjust its interaction strategies to support the development of higher-order thinking and cognition. Experimental results in simulated learning scenarios demonstrate that OnlineMate effectively fosters deep learning and discussions while enhancing cognitive engagement in online educational settings.
OnlineMate是一个基于大语言模型和心理理论的多智能体学习伙伴系统,能动态适应学习者的认知状态,有效提升在线教育中的深度学习和认知参与度。
OnlineMate, a multi-agent system powered by LLMs and Theory of Mind, adapts to learners' cognitive states to enhance engagement and promote deep learning in online education.

Authors:Simiao Liu, Fang Liu, Liehao Li, Xin Tan, Yinghao Zhu, Xiaoli Lian, Li Zhang
Title: An Empirical Study on Failures in Automated Issue Solving
Abstract:
Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic tools show great promise, they still fail on a substantial portion of tasks. Moreover, current evaluations primarily report aggregate issue-solving rates, which obscure the underlying causes of success and failure, making it challenging to diagnose model weaknesses or guide targeted improvements. To bridge this gap, we first analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified under varying task characteristics. Furthermore, to move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances. From this analysis, we developed a comprehensive taxonomy of failure modes comprising 3 primary phases, 9 main categories, and 25 fine-grained subcategories. Then we systematically analyze the distribution of the identified failure modes, the results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks. Motivated by these insights, we propose a collaborative Expert-Executor framework. It introduces a supervisory Expert agent tasked with providing strategic oversight and course-correction for a primary Executor agent. This architecture is designed to correct flawed reasoning and break the cognitive deadlocks that frequently lead to failure. Experiments show that our framework solves 22.2% of previously intractable issues for a leading single agent. These findings pave the way for building more robust agents through diagnostic evaluation and collaborative design.
中文总结:本研究分析了自动化问题解决工具在SWE-Bench基准测试中的失败模式,提出了专家-执行者协作框架,通过纠正推理缺陷和打破认知僵局,显著提升了问题解决成功率。
English Summary: This study analyzes failure modes in automated issue-solving tools on the SWE-Bench benchmark and proposes a collaborative Expert-Executor framework that improves success rates by addressing reasoning flaws and cognitive deadlocks.

Authors:Yulan Guo, Longguang Wang, Wendong Mao, Xiaoyu Dong, Yingqian Wang, Li Liu, Wei An
Title: Deep Lookup Network
Abstract:
Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires {more} energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).
中文: 本文提出一种高效的查表操作来替代卷积神经网络中计算密集的乘法运算,在保持各类任务竞争力的同时,显著降低了能耗并提升了推理速度,适用于资源受限的边缘设备。
English: This paper proposes an efficient lookup operation to replace computationally intensive multiplication in convolutional neural networks, enabling energy-efficient and fast inference on resource-limited devices while maintaining competitive performance across various tasks.

Authors:Kohou Wang, Huan Hu, Xiang Liu, Zezhou Chen, Ping Chen, Zhaoxiang Liu, Shiguo Lian
Title: Hierarchical Deep Fusion Framework for Multi-dimensional Facial Forgery Detection -- The 2024 Global Deepfake Image Detection Challenge
Abstract:
The proliferation of sophisticated deepfake technology poses significant challenges to digital security and authenticity. Detecting these forgeries, especially across a wide spectrum of manipulation techniques, requires robust and generalized models. This paper introduces the Hierarchical Deep Fusion Framework (HDFF), an ensemble-based deep learning architecture designed for high-performance facial forgery detection. Our framework integrates four diverse pre-trained sub-models, Swin-MLP, CoAtNet, EfficientNetV2, and DaViT, which are meticulously fine-tuned through a multi-stage process on the MultiFFDI dataset. By concatenating the feature representations from these specialized models and training a final classifier layer, HDFF effectively leverages their collective strengths. This approach achieved a final score of 0.96852 on the competition's private leaderboard, securing the 20th position out of 184 teams, demonstrating the efficacy of hierarchical fusion for complex image classification tasks.
Chinese: 本文提出的分层深度融合框架(HDFF)通过集成四个预训练子模型实现高效人脸伪造检测,在184支队伍中以0.96852的得分获得第20名,验证了分层融合在复杂图像分类任务中的有效性。
English: This paper presents the Hierarchical Deep Fusion Framework (HDFF), an ensemble deep learning model that integrates four pre-trained sub-models for robust facial forgery detection, achieving 20th place out of 184 teams with a score of 0.96852 in a competition.

Authors:Erblin Isaku, Hassan Sartaj, Shaukat Ali, Beatriz Sanguino, Tongtong Wang, Guoyuan Li, Houxiang Zhang, Thomas Peyrucain
Title: Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins
Abstract:
Self-adaptive robots (SARs) in complex, uncertain environments must proactively detect and address abnormal behaviors, including out-of-distribution (OOD) cases. To this end, digital twins offer a valuable solution for OOD detection. Thus, we present a digital twin-based approach for OOD detection (ODiSAR) in SARs. ODiSAR uses a Transformer-based digital twin to forecast SAR states and employs reconstruction error and Monte Carlo dropout for uncertainty quantification. By combining reconstruction error with predictive variance, the digital twin effectively detects OOD behaviors, even in previously unseen conditions. The digital twin also includes an explainability layer that links potential OOD to specific SAR states, offering insights for self-adaptation. We evaluated ODiSAR by creating digital twins of two industrial robots: one navigating an office environment, and another performing maritime ship navigation. In both cases, ODiSAR forecasts SAR behaviors (i.e., robot trajectories and vessel motion) and proactively detects OOD events. Our results showed that ODiSAR achieved high detection performance -- up to 98\% AUROC, 96\% TNR@TPR95, and 95\% F1-score -- while providing interpretable insights to support self-adaptation.
中文: 本研究提出了ODiSAR方法,利用基于Transformer的数字孪生预测自适应机器人状态,通过重构误差和不确定性量化有效检测分布外行为,在工业机器人测试中实现了高达98%的AUROC检测性能,并提供可解释性分析。
English: The study introduces ODiSAR, a digital twin-based method using Transformer models to forecast self-adaptive robot states and detect out-of-distribution behaviors with high accuracy and explainability, achieving up to 98% AUROC in evaluations on industrial robots.

Authors:Haozhan Ni, Jingsong Liang, Chenyu He, Yuhong Cao, Guillaume Sartoretti
Title: GRATE: a Graph transformer-based deep Reinforcement learning Approach for Time-efficient autonomous robot Exploration
Abstract:
Autonomous robot exploration (ARE) is the process of a robot autonomously navigating and mapping an unknown environment. Recent Reinforcement Learning (RL)-based approaches typically formulate ARE as a sequential decision-making problem defined on a collision-free informative graph. However, these methods often demonstrate limited reasoning ability over graph-structured data. Moreover, due to the insufficient consideration of robot motion, the resulting RL policies are generally optimized to minimize travel distance, while neglecting time efficiency. To overcome these limitations, we propose GRATE, a Deep Reinforcement Learning (DRL)-based approach that leverages a Graph Transformer to effectively capture both local structure patterns and global contextual dependencies of the informative graph, thereby enhancing the model's reasoning capability across the entire environment. In addition, we deploy a Kalman filter to smooth the waypoint outputs, ensuring that the resulting path is kinodynamically feasible for the robot to follow. Experimental results demonstrate that our method exhibits better exploration efficiency (up to 21.5% in distance and 21.3% in time to complete exploration) than state-of-the-art conventional and learning-based baselines in various simulation benchmarks. We also validate our planner in real-world scenarios.
Chinese: GRATE是一种基于深度强化学习的方法,利用图变换器增强对图结构数据的推理能力,并结合卡尔曼滤波器确保运动学可行的路径,在探索任务中相比现有最优方法实现了高达21.5%的距离和21.3%的时间效率提升。
English: GRATE is a Deep Reinforcement Learning approach that uses a Graph Transformer to improve reasoning over graph-structured data and a Kalman filter to ensure kinodynamically feasible paths, achieving up to 21.5% distance and 21.3% time efficiency gains in exploration compared to state-of-the-art methods.

Authors:Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum
Title: LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Abstract:
The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.
中文: LazyDrag首次为多模态扩散变换器引入基于拖动的编辑方法,通过消除隐式点匹配并生成显式对应图,实现了无需测试时优化的稳定全强度反转,将精确几何控制与文本引导统一起来,从而完成复杂编辑。
English: LazyDrag introduces the first drag-based editing method for Multi-Modal Diffusion Transformers by eliminating implicit point matching and generating explicit correspondence maps, enabling stable full-strength inversion without test-time optimization and unifying precise geometric control with text guidance for complex edits.

Authors:Kaining Wang, Bo Yang, Zhiwen Yu, Xuelin Cao, Mérouane Debbah, Chau Yuen
Title: Optimization for Massive 3D-RIS Deployment: A Generative Diffusion Model-Based Approach
Abstract:
Reconfigurable Intelligent Surfaces (RISs) transform the wireless environment by modifying the amplitude, phase, and polarization of incoming waves, significantly improving coverage performance. Notably, optimizing the deployment of RISs becomes vital, but existing optimization methods face challenges such as high computational complexity, limited adaptability to changing environments, and a tendency to converge on local optima. In this paper, we propose to optimize the deployment of large-scale 3D RISs using a diffusion model based on probabilistic generative learning. We begin by dividing the target area into fixed grids, with each grid corresponding to a potential deployment location. Then, a multi-RIS deployment optimization problem is formulated, which is difficult to solve directly. By treating RIS deployment as a conditional generation task, the well-trained diffusion model can generate the distribution of deployment strategies, and thus, the optimal deployment strategy can be obtained by sampling from this distribution. Simulation results demonstrate that the proposed diffusion-based method outperforms traditional benchmark approaches in terms of exceed ratio and generalization.
中文摘要:本文提出一种基于扩散模型的大规模三维可重构智能表面部署优化方法,通过将部署问题转化为条件生成任务,有效解决了传统方法计算复杂、适应性差的问题,仿真结果显示其性能优于基准方法。
English Summary: This paper introduces a diffusion model-based approach to optimize the deployment of large-scale 3D Reconfigurable Intelligent Surfaces (RISs), overcoming computational complexity and adaptability limitations of traditional methods by treating deployment as a conditional generation task.

Authors:Zhexi Peng, Kun Zhou, Tianjia Shao
Title: Gaussian-Plus-SDF SLAM: High-fidelity 3D Reconstruction at 150+ fps
Abstract:
While recent Gaussian-based SLAM methods achieve photorealistic reconstruction from RGB-D data, their computational performance remains a critical bottleneck. State-of-the-art techniques operate at less than 20 fps, significantly lagging behind geometry-centric approaches like KinectFusion (hundreds of fps). This limitation stems from the heavy computational burden: modeling scenes requires numerous Gaussians and complex iterative optimization to fit RGB-D data, where insufficient Gaussian counts or optimization iterations cause severe quality degradation. To address this, we propose a Gaussian-SDF hybrid representation, combining a colorized Signed Distance Field (SDF) for smooth geometry and appearance with 3D Gaussians to capture underrepresented details. The SDF is efficiently constructed via RGB-D fusion (as in geometry-centric methods), while Gaussians undergo iterative optimization. Our representation enables drastic Gaussian reduction (50% fewer) by avoiding full-scene Gaussian modeling, and efficient Gaussian optimization (75% fewer iterations) through targeted appearance refinement. Building upon this representation, we develop GPS-SLAM (Gaussian-Plus-SDF SLAM), a real-time 3D reconstruction system achieving over 150 fps on real-world Azure Kinect sequences -- delivering an order-of-magnitude speedup over state-of-the-art techniques while maintaining comparable reconstruction quality. We will release the source code and data to facilitate future research.
中文: GPS-SLAM系统通过结合高斯和SDF表示,在保持重建质量的同时实现了超过150帧/秒的实时三维重建,相比现有技术获得了十倍的速度提升。
English: The proposed GPS-SLAM system combines Gaussian and SDF representations to achieve real-time 3D reconstruction at over 150 fps while maintaining quality, offering a tenfold speed improvement over current methods.

Authors:Qiuhao Liu, Ling Li, Yao Lu, Qi Xuan, Zhaowei Zhu, Jiaheng Wei
Title: SelectMix: Enhancing Label Noise Robustness through Targeted Sample Mixing
Abstract:
Deep neural networks tend to memorize noisy labels, severely degrading their generalization performance. Although Mixup has demonstrated effectiveness in improving generalization and robustness, existing Mixup-based methods typically perform indiscriminate mixing without principled guidance on sample selection and mixing strategy, inadvertently propagating noisy supervision. To overcome these limitations, we propose SelectMix, a confidence-guided mixing framework explicitly tailored for noisy labels. SelectMix first identifies potentially noisy or ambiguous samples through confidence based mismatch analysis using K-fold cross-validation, then selectively blends identified uncertain samples with confidently predicted peers from their potential classes. Furthermore, SelectMix employs soft labels derived from all classes involved in the mixing process, ensuring the labels accurately represent the composition of the mixed samples, thus aligning supervision signals closely with the actual mixed inputs. Through extensive theoretical analysis and empirical evaluations on multiple synthetic (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100) and real-world benchmark datasets (CIFAR-N, MNIST and Clothing1M), we demonstrate that SelectMix consistently outperforms strong baseline methods, validating its effectiveness and robustness in learning with noisy labels.
中文摘要:SelectMix是一种基于置信度指导的混合框架,通过选择性融合不确定样本与可信样本并采用软标签,有效应对噪声标签学习问题,在多个基准测试中均优于现有方法。
English Summary: SelectMix is a confidence-guided framework that selectively blends uncertain samples with confident peers using soft labels to effectively learn from noisy data, consistently outperforming existing methods across multiple benchmarks.

Authors:Giuseppe Silano, Amr Afifi, Martin Saska, Antonio Franchi
Title: STL-Based Motion Planning and Uncertainty-Aware Risk Analysis for Human-Robot Collaboration with a Multi-Rotor Aerial Vehicle
Abstract:
This paper presents a novel approach to motion planning and risk analysis for enhancing human-robot collaboration using a Multi-Rotor Aerial Vehicle (MRAV). The proposed method uses Signal Temporal Logic (STL) to encode key mission objectives, such as safety, timing, and human preferences, with a strong focus on ergonomics and comfort. An optimization framework generates dynamically feasible trajectories while considering the MRAV's physical constraints. Given the nonlinear and non-convex nature of the problem, smooth approximations and gradient-based techniques assist in handling the problem's computational complexity. Additionally, an uncertainty-aware risk analysis is incorporated to assess potential deviations from the mission specifications, providing insights into the likelihood of mission success under uncertain conditions. Further, an event-triggered replanning strategy is implemented to respond to unforeseen events and external disturbances. The approach is validated through MATLAB and Gazebo simulations, using an object handover task in a mock-up environment inspired by power line maintenance scenarios. The results highlight the method's effectiveness in achieving safe, efficient, and resilient human-robot collaboration.
中文摘要:本文提出了一种基于信号时序逻辑和优化框架的无人机运动规划与风险分析方法,通过在电力维护场景仿真验证,实现了安全高效的人机协作。
English Summary: This paper introduces a motion planning and risk analysis method using Signal Temporal Logic and optimization to ensure safe, efficient human-robot collaboration with aerial vehicles, validated through simulations in power line maintenance scenarios.

Authors:Wei Dai, Shengen Wu, Wei Wu, Zhenhao Wang, Sisuo Lyu, Haicheng Liao, Limin Yu, Weiping Ding, Runwei Guan, Yutao Yue
Title: Large Foundation Models for Trajectory Prediction in Autonomous Driving: A Comprehensive Survey
Abstract:
Trajectory prediction serves as a critical functionality in autonomous driving, enabling the anticipation of future motion paths for traffic participants such as vehicles and pedestrians, which is essential for driving safety. Although conventional deep learning methods have improved accuracy, they remain hindered by inherent limitations, including lack of interpretability, heavy reliance on large-scale annotated data, and weak generalization in long-tail scenarios. The rise of Large Foundation Models (LFMs) is transforming the research paradigm of trajectory prediction. This survey offers a systematic review of recent advances in LFMs, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) for trajectory prediction. By integrating linguistic and scene semantics, LFMs facilitate interpretable contextual reasoning, significantly enhancing prediction safety and generalization in complex environments. The article highlights three core methodologies: trajectory-language mapping, multimodal fusion, and constraint-based reasoning. It covers prediction tasks for both vehicles and pedestrians, evaluation metrics, and dataset analyses. Key challenges such as computational latency, data scarcity, and real-world robustness are discussed, along with future research directions including low-latency inference, causality-aware modeling, and motion foundation models.
中文: 轨迹预测对自动驾驶安全至关重要,传统深度学习方法在可解释性和泛化性方面存在局限,而大型基础模型通过语言和多模态融合实现情境推理,正在革新该领域的研究范式。
English: Trajectory prediction is vital for autonomous driving safety, and while traditional deep learning methods face limitations in interpretability and generalization, Large Foundation Models (LFMs) are revolutionizing the field by enabling contextual reasoning through linguistic and multimodal integration.

Authors:Sugyeong Eo, Jungjun Lee, Chanjun Park, Heuiseok Lim
Title: Mixture-of-Clustered-Experts: Advancing Expert Specialization and Generalization in Instruction Tuning
Abstract:
A sparse Mixture-of-Experts (MoE) architecture has emerged as a highly scalable solution by conditionally activating sub-modules without a proportional increase in computational costs. However, improving expert specialization to enhance performance and generalization remains a challenge for MoE, especially in instruction tuning scenarios characterized by significant input heterogeneity. In this work, we propose the Mixture-of-Clustered-Experts (MoCE) to address this limitation through a dual-stage routing mechanism. The first stage in the mechanism performs expert group routing based on sequence-level features, while the second stage activates the top-$k$ experts within the group at the token level. This approach enables the effective partitioning of heterogeneous inputs based on their knowledge requirements, encouraging expert group specialization while maintaining the advantages of token-level routing. We evaluate MoCE across a comprehensive set of benchmarks, demonstrating its consistent superiority over strong baselines and its enhanced generalization capabilities. Detailed analysis further highlights the robustness and effectiveness of MoCE.
Chinese: 混合聚类专家模型(MoCE)通过双阶段路由机制,先基于序列级特征进行专家分组,再在令牌级激活组内顶尖专家,有效划分异构输入以增强专家专业化和泛化能力,同时保持计算效率优势。
English: The Mixture-of-Clustered-Experts (MoCE) introduces a dual-stage routing mechanism that first groups experts based on sequence-level features and then activates top experts at the token level, effectively partitioning heterogeneous inputs to enhance specialization and generalization while maintaining computational efficiency.

Authors:Hongtao Liang, Yihe Diao, YuHang Wu, Fuhui Zhou, Qihui Wu
Title: Synergetic Empowerment: Wireless Communications Meets Embodied Intelligence
Abstract:
Wireless communication is evolving into an agent era, where large-scale agents with inherent embodied intelligence are not just users but active participants. The perfect combination of wireless communication and embodied intelligence can achieve a synergetic empowerment and greatly facilitate the development of agent communication. An overview of this synergetic empowerment is presented, framing it as a co-evolutionary process that transforms wireless communication from a simple utility into the digital nervous system of a collective intelligence, while simultaneously elevating isolated agents into a unified superorganism with emergent capabilities far exceeding individual contributions. Moreover, we elaborate how embodied intelligence and wireless communication mutually benefit each other through the lens of the perception-cognition-execution (PCE) loop, revealing a fundamental duality where each PCE stage both challenges network capacity and creates unprecedented opportunities for system-wide optimization. Furthermore, critical open issues and future research directions are identified.
中文摘要:无线通信正进入智能体时代,通过感知-认知-执行循环实现具身智能与通信系统的协同进化,将分散的智能体融合为具有涌现能力的超级有机体,同时开辟了系统优化的新途径。
English Summary: Wireless communication is advancing into an agent era where embodied intelligence transforms it into a collective digital nervous system, while the perception-cognition-execution loop reveals mutual optimization opportunities between agents and networks.

Authors:Saarth Gaonkar, Xiang Zheng, Haocheng Xi, Rishabh Tiwari, Kurt Keutzer, Dmitriy Morozov, Michael W. Mahoney, Amir Gholami
Title: SciML Agents: Write the Solver, Not the Solution
Abstract:
Recent work in scientific machine learning aims to tackle scientific tasks directly by predicting target values with neural networks (e.g., physics-informed neural networks, neural ODEs, neural operators, etc.), but attaining high accuracy and robustness has been challenging. We explore an alternative view: use LLMs to write code that leverages decades of numerical algorithms. This shifts the burden from learning a solution function to making domain-aware numerical choices. We ask whether LLMs can act as SciML agents that, given a natural-language ODE description, generate runnable code that is scientifically appropriate, selecting suitable solvers (stiff vs. non-stiff), and enforcing stability checks. There is currently no benchmark to measure this kind of capability for scientific computing tasks. As such, we first introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks. The diagnostic set contains problems whose superficial appearance suggests stiffness, and that require algebraic simplification to demonstrate non-stiffness; and the large-scale benchmark spans stiff and non-stiff ODE regimes. We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants. Our evaluation measures both executability and numerical validity against reference solutions. We find that with sufficient context and guided prompts, newer instruction-following models achieve high accuracy on both criteria. In many cases, recent open-source systems perform strongly without fine-tuning, while older or smaller models still benefit from fine-tuning. Overall, our preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
Chinese: 最新研究探索将大型语言模型作为科学机器学习代理,通过生成可执行代码解决微分方程问题,将重点从函数学习转向数值算法选择,新基准测试表明引导式提示能使代码生成和数值有效性达到高精度。
English: Recent research explores using large language models as scientific machine learning agents to generate executable code for solving differential equations, shifting the focus from function learning to numerical algorithm selection, with new benchmarks showing that guided prompting enables high accuracy in code generation and numerical validity.

Authors:Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li
Title: Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
Abstract:
Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.
中文摘要:该摘要介绍了AncientDoc——首个针对中文古籍的综合性评估基准,通过五项不同任务评估视觉语言模型在古籍处理上的表现,填补了现有技术在复杂古籍数字化与理解方面的空白。
English Summary: This abstract introduces AncientDoc, the first comprehensive benchmark designed to evaluate Vision-Language Models on Chinese ancient documents through five distinct tasks, addressing existing gaps in digitization and understanding of these culturally rich materials.

Authors:Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma, Yajun Xu, Wenjing Zhang, Yibing Nan, Kai Wang, Shiguo Lian
Title: MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance
Abstract:
General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce MITS (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes 170,400 independently collected real-world ITS images sourced from traffic surveillance cameras, annotated with eight main categories and 24 subcategories of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate high-quality image captions and 5 million instruction-following visual question-answer pairs, addressing five critical ITS tasks: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS's effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5's performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6's from 0.678 to 0.921 (+35.8%), Qwen2-VL's from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL's from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as open-source, providing high-value resources to advance both ITS and LMM research.
中文: 针对通用大模型在智能交通监控领域的性能局限,MITS作为首个大规模多模态基准数据集被提出,通过微调显著提升了模型性能,并为研究提供了开源资源。
English: To address the limitations of general-domain large multimodal models in intelligent traffic surveillance, the MITS dataset was introduced as the first large-scale multimodal benchmark, significantly enhancing model performance through fine-tuning and providing open-source resources for research.

Authors:Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar
Title: Balancing Utility and Privacy: Dynamically Private SGD with Random Projection
Abstract:
Stochastic optimization is a pivotal enabler in modern machine learning, producing effective models for various tasks. However, several existing works have shown that model parameters and gradient information are susceptible to privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It exhibits provably sub-linear convergence rates across different objective functions, matching the best available rate. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD remarkably enhances accuracy while maintaining privacy. Our code is available here.
中文: D2P2-SGD优化器融合动态差分隐私与自动梯度裁剪及随机投影技术,能动态调整模型效用与隐私的平衡,在多种数据集上实现更高精度并保持可证明的收敛性能。
English: The D2P2-SGD optimizer combines dynamic differential privacy with automatic gradient clipping and random projection to dynamically balance utility and privacy, achieving superior accuracy with provable convergence rates across various datasets.

Authors:Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan
Title: OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
Abstract:
Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io
中文:OmniEVA通过任务自适应3D定位机制和具身感知推理框架,解决了多模态大语言模型在几何适应性和具身约束方面的不足,在各类具身推理与规划任务中实现了最优性能表现。
English: OmniEVA addresses geometric adaptability and embodiment constraint gaps in multimodal large language models by introducing a task-adaptive 3D grounding mechanism and an embodiment-aware reasoning framework, achieving state-of-the-art performance in embodied reasoning and versatile planning across diverse scenarios.

Authors:Efe Bozkir, Babette Bühler, Xiaoyuan Wu, Enkelejda Kasneci, Lujo Bauer, Lorrie Faith Cranor
Title: The Impact of Device Type, Data Practices, and Use Case Scenarios on Privacy Concerns about Eye-tracked Augmented Reality in the United States and Germany
Abstract:
Augmented reality technology will likely be prevalent with more affordable head-mounted displays. Integrating novel interaction modalities such as eye trackers into head-mounted displays could lead to collecting vast amounts of biometric data, which may allow inference of sensitive user attributes like health status or sexual preference, posing privacy issues. While previous works broadly examined privacy concerns about augmented reality, ours is the first to extensively explore privacy concerns on behavioral data, particularly eye tracking in augmented reality. We crowdsourced four survey studies in the United States (n1 = 48, n2 = 525) and Germany (n3 = 48, n4 = 525) to understand the impact of user attributes, augmented reality devices, use cases, data practices, and country on privacy concerns. Our findings indicate that participants are generally concerned about privacy when they know what inferences can be made based on the collected data. Despite the more prominent use of smartphones in daily life than augmented reality glasses, we found no indications of differing privacy concerns depending on the device type. In addition, our participants are more comfortable when a particular use case benefits them and less comfortable when other humans can consume their data. Furthermore, participants in the United States are less concerned about their privacy than those in Germany. Based on our findings, we provide several recommendations to practitioners and policymakers for privacy-aware augmented reality.
Chinese: 增强现实技术结合眼动追踪可能通过推断敏感用户特征引发隐私风险,研究显示美国和德国参与者了解数据推断后普遍担忧隐私,且舒适度受使用场景和国籍影响,美国参与者比德国参与者更不关注隐私。
English: Augmented reality's integration of eye tracking raises privacy risks by potentially inferring sensitive user traits, with studies in the U.S. and Germany showing heightened concerns when data inferences are understood and varying comfort levels based on use cases and nationality.

Authors:Yao Lu, Chunfeng Sun, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui
Title: \emph{FoQuS}: A Forgetting-Quality Coreset Selection Framework for Automatic Modulation Recognition
Abstract:
Deep learning-based Automatic Modulation Recognition (AMR) model has made significant progress with the support of large-scale labeled data. However, when developing new models or performing hyperparameter tuning, the time and energy consumption associated with repeated training using massive amounts of data are often unbearable. To address the above challenges, we propose \emph{FoQuS}, which approximates the effect of full training by selecting a coreset from the original dataset, thereby significantly reducing training overhead. Specifically, \emph{FoQuS} records the prediction trajectory of each sample during full-dataset training and constructs three importance metrics based on training dynamics. Experiments show that \emph{FoQuS} can maintain high recognition accuracy and good cross-architecture generalization on multiple AMR datasets using only 1\%-30\% of the original data.
FoQuS significantly reduces training overhead for Automatic Modulation Recognition by selecting a coreset based on training dynamics, achieving high accuracy with only 1%-30% of data while maintaining cross-architecture generalization.
English Summary:

Authors:Houjian Yu, Zheming Zhou, Min Sun, Omid Ghasemalizadeh, Yuyin Sun, Cheng-Hao Kuo, Arnie Sen, Changhyun Choi
Title: Attribute-based Object Grounding and Robot Grasp Detection with Spatial Reasoning
Abstract:
Enabling robots to grasp objects specified through natural language is essential for effective human-robot interaction, yet it remains a significant challenge. Existing approaches often struggle with open-form language expressions and typically assume unambiguous target objects without duplicates. Moreover, they frequently rely on costly, dense pixel-wise annotations for both object grounding and grasp configuration. We present Attribute-based Object Grounding and Robotic Grasping (OGRG), a novel framework that interprets open-form language expressions and performs spatial reasoning to ground target objects and predict planar grasp poses, even in scenes containing duplicated object instances. We investigate OGRG in two settings: (1) Referring Grasp Synthesis (RGS) under pixel-wise full supervision, and (2) Referring Grasp Affordance (RGA) using weakly supervised learning with only single-pixel grasp annotations. Key contributions include a bi-directional vision-language fusion module and the integration of depth information to enhance geometric reasoning, improving both grounding and grasping performance. Experiment results show that OGRG outperforms strong baselines in tabletop scenes with diverse spatial language instructions. In RGS, it operates at 17.59 FPS on a single NVIDIA RTX 2080 Ti GPU, enabling potential use in closed-loop or multi-object sequential grasping, while delivering superior grounding and grasp prediction accuracy compared to all the baselines considered. Under the weakly supervised RGA setting, OGRG also surpasses baseline grasp-success rates in both simulation and real-robot trials, underscoring the effectiveness of its spatial reasoning design. Project page: https://z.umn.edu/ogrg
中文: OGRG框架通过双向视觉语言融合和深度增强推理,使机器人能解析开放语言指令以定位目标物体并预测抓取姿态,即使在存在重复物体场景中,也在全监督和弱监督设置下均实现了优越性能。
English: The OGRG framework enables robots to interpret open-form language commands for grounding target objects and predicting grasp poses, even with duplicate instances, achieving superior performance in both fully and weakly supervised settings through bi-directional vision-language fusion and depth-enhanced reasoning.

Authors:Xiaolin Chen, Xuemeng Song, Haokun Wen, Weili Guan, Xiangyu Zhao, Liqiang Nie
Title: Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems
Abstract:
Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type's utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.
中文: 本研究提出DK2R模型,通过结合结构化属性和非结构化评论知识,利用大语言模型增强多模态任务导向对话系统的文本响应生成,有效解决了动态知识类型选择和意图-响应解耦两大挑战。
English: This research introduces DK2R, a dual knowledge-enhanced two-stage reasoner that leverages structured attributes and unstructured reviews with large language models to improve textual response generation in multimodal task-oriented dialog systems, addressing challenges in dynamic knowledge selection and intention-response decoupling.

Authors:Yandi Yang, Jianping Li, Youqi Liao, Yuhao Li, Yizhe Zhang, Zhen Dong, Bisheng Yang, Naser El-Sheimy
Title: Aerial-ground Cross-modal Localization: Dataset, Ground-truth, and Benchmark
Abstract:
Accurate visual localization in dense urban environments poses a fundamental task in photogrammetry, geospatial information science, and robotics. While imagery is a low-cost and widely accessible sensing modality, its effectiveness on visual odometry is often limited by textureless surfaces, severe viewpoint changes, and long-term drift. The growing public availability of airborne laser scanning (ALS) data opens new avenues for scalable and precise visual localization by leveraging ALS as a prior map. However, the potential of ALS-based localization remains underexplored due to three key limitations: (1) the lack of platform-diverse datasets, (2) the absence of reliable ground-truth generation methods applicable to large-scale urban environments, and (3) limited validation of existing Image-to-Point Cloud (I2P) algorithms under aerial-ground cross-platform settings. To overcome these challenges, we introduce a new large-scale dataset that integrates ground-level imagery from mobile mapping systems with ALS point clouds collected in Wuhan, Hong Kong, and San Francisco.
中文摘要:该摘要介绍了一个新的大规模数据集,整合地面影像与机载激光扫描数据,旨在解决城市环境中视觉定位的局限性,如无纹理表面和跨平台验证不足等问题。
English Summary: The abstract introduces a new large-scale dataset combining ground-level imagery and airborne laser scanning data to address limitations in visual localization, such as textureless surfaces and cross-platform validation gaps in urban environments.

Authors:Mpoki Mwaisela, Peterson Yuhala, Pascal Felber, Valerio Schiavoni
Title: IM-PIR: In-Memory Private Information Retrieval
Abstract:
Private information retrieval (PIR) is a cryptographic primitive that allows a client to securely query one or multiple servers without revealing their specific interests. In spite of their strong security guarantees, current PIR constructions are computationally costly. Specifically, most PIR implementations are memory-bound due to the need to scan extensive databases (in the order of GB), making them inherently constrained by the limited memory bandwidth in traditional processor-centric computing architectures.Processing-in-memory (PIM) is an emerging computing paradigm that augments memory with compute capabilities, addressing the memory bandwidth bottleneck while simultaneously providing extensive parallelism.Recent research has demonstrated PIM's potential to significantly improve performance across a range of data-intensive workloads, including graph processing, genome analysis, and machine learning. In this work, we propose the first PIM-based architecture for multi-server PIR. We discuss the algorithmic foundations of the latter and show how its operations align with the core strengths of PIM architectures: extensive parallelism and high memory bandwidth. Based on this observation, we design and implement IM-PIR, a PIM-based multi-server PIR approach on top of UPMEM PIM, the first openly commercialized PIM architecture. Our evaluation demonstrates that a PIM-based multi-server PIR implementation significantly improves query throughput by more than 3.7x when compared to a standard CPU-based PIR approach.
Chinese: 本文提出了首个基于内存计算(PIM)的多服务器私有信息检索系统IM-PIR,通过利用PIM架构的并行处理能力突破传统CPU方案的内存带宽限制,实现了超过3.7倍的查询吞吐量提升。
English: This paper introduces IM-PIR, the first processing-in-memory (PIM) based multi-server private information retrieval (PIR) system, which overcomes the memory bandwidth limitations of traditional CPU-based approaches by leveraging PIM's parallel processing capabilities to achieve over 3.7x higher query throughput.

Authors:Junjie Chen, Yao Hu, Junjie Li, Kangyue Li, Kun Liu, Wenpeng Li, Xu Li, Ziyuan Li, Feiyu Shen, Xu Tang, Manzhen Wei, Yichen Wu, Fenglong Xie, Kaituo Xu, Kun Xie
Title: FireRedChat: A Pluggable, Full-Duplex Voice Interaction System with Cascaded and Semi-Cascaded Implementations
Abstract:
Full-duplex voice interaction allows users and agents to speak simultaneously with controllable barge-in, enabling lifelike assistants and customer service. Existing solutions are either end-to-end, difficult to design and hard to control, or modular pipelines governed by turn-taking controllers that ease upgrades and per-module optimization; however, prior modular frameworks depend on non-open components and external providers, limiting holistic optimization. In this work, we present a complete, practical full-duplex voice interaction system comprising a turn-taking controller, an interaction module, and a dialogue manager. The controller integrates streaming personalized VAD (pVAD) to suppress false barge-ins from noise and non-primary speakers, precisely timestamp primary-speaker segments, and explicitly enable primary-speaker barge-ins; a semantic end-of-turn detector improves stop decisions. It upgrades heterogeneous half-duplex pipelines, cascaded, semi-cascaded, and speech-to-speech, to full duplex. Using internal models, we implement cascaded and semi-cascaded variants; the semi-cascaded one captures emotional and paralinguistic cues, yields more coherent responses, lowers latency and error propagation, and improves robustness. A dialogue manager extends capabilities via tool invocation and context management. We also propose three system-level metrics, barge-in, end-of-turn detection accuracy, and end-to-end latency, to assess naturalness, control accuracy, and efficiency. Experiments show fewer false interruptions, more accurate semantic ends, and lower latency approaching industrial systems, enabling robust, natural, real-time full-duplex interaction. Demos: https://fireredteam.github.io/demos/firered_chat.
中文: 本研究提出了一种实用的全双工语音交互系统,通过集成具有个性化语音活动检测和语义对话结束检测的对话轮换控制器,将多种半双工流程升级为全双工,有效减少误中断、降低延迟,并提升交互的自然度与鲁棒性。
English: This work presents a practical full-duplex voice interaction system that upgrades various half-duplex pipelines to full duplex, integrating a turn-taking controller with personalized VAD and semantic end-of-turn detection to reduce false interruptions, lower latency, and enhance interaction naturalness and robustness.

Authors:Ching-Chun Chang, Isao Echizen
Title: Tell-Tale Watermarks for Explanatory Reasoning in Synthetic Media Forensics
Abstract:
The rise of synthetic media has blurred the boundary between reality and fabrication under the evolving power of artificial intelligence, fueling an infodemic that erodes public trust in cyberspace. For digital imagery, a multitude of editing applications further complicates the forensic analysis, including semantic edits that alter content, photometric adjustments that recalibrate colour characteristics, and geometric projections that reshape viewpoints. Collectively, these transformations manipulate and control perceptual interpretation of digital imagery. This susceptibility calls for forensic enquiry into reconstructing the chain of events, thereby revealing deeper evidential insight into the presence or absence of criminal intent. This study seeks to address an inverse problem of tracing the underlying generation chain that gives rise to the observed synthetic media. A tell-tale watermarking system is developed for explanatory reasoning over the nature and extent of transformations across the lifecycle of synthetic media. Tell-tale watermarks are tailored to different classes of transformations, responding in a manner that is neither strictly robust nor fragile but instead interpretable. These watermarks function as reference clues that evolve under the same transformation dynamics as the carrier media, leaving interpretable traces when subjected to transformations. Explanatory reasoning is then performed to infer the most plausible account across the combinatorial parameter space of composite transformations. Experimental evaluations demonstrate the validity of tell-tale watermarking with respect to fidelity, synchronicity and traceability.
中文摘要:本研究开发了一种可追溯水印系统,通过嵌入随合成媒体同步演变的可解读水印来追踪其变换历程,从而实现对媒体篡改和犯罪意图的取证分析。
English Summary: The study develops a tell-tale watermarking system to trace the transformation history of synthetic media by embedding interpretable watermarks that evolve with the media, enabling forensic analysis to detect manipulation and criminal intent.

Authors:Longhao Li, Zhao Guo, Hongjie Chen, Yuhang Dai, Ziyu Zhang, Hongfei Xue, Tianlun Zuo, Chengyou Wang, Shuiyuan Wang, Jie Li, Jian Kang, Xin Xu, Hui Bu, Binbin Zhang, Ruibin Yuan, Ziya Zhou, Wei Xue, Lei Xie
Title: WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation
Abstract:
The development of speech understanding and generation has been significantly accelerated by the availability of large-scale, high-quality speech datasets. Among these, ASR and TTS are regarded as the most established and fundamental tasks. However, for Cantonese (Yue Chinese), spoken by approximately 84.9 million native speakers worldwide, limited annotated resources have hindered progress and resulted in suboptimal ASR and TTS performance. To address this challenge, we propose WenetSpeech-Pipe, an integrated pipeline for building large-scale speech corpus with multi-dimensional annotation tailored for speech understanding and generation. It comprises six modules: Audio Collection, Speaker Attributes Annotation, Speech Quality Annotation, Automatic Speech Recognition, Text Postprocessing and Recognizer Output Voting, enabling rich and high-quality annotations. Based on this pipeline, we release WenetSpeech-Yue, the first large-scale Cantonese speech corpus with multi-dimensional annotation for ASR and TTS, covering 21,800 hours across 10 domains with annotations including ASR transcription, text confidence, speaker identity, age, gender, speech quality scores, among other annotations. We also release WSYue-eval, a comprehensive Cantonese benchmark with two components: WSYue-ASR-eval, a manually annotated set for evaluating ASR on short and long utterances, code-switching, and diverse acoustic conditions, and WSYue-TTS-eval, with base and coverage subsets for standard and generalization testing. Experimental results show that models trained on WenetSpeech-Yue achieve competitive results against state-of-the-art (SOTA) Cantonese ASR and TTS systems, including commercial and LLM-based models, highlighting the value of our dataset and pipeline.
中文: 大规模高质量语音数据集推动了语音理解与生成的发展,但粤语因标注资源有限而进展缓慢,为此我们提出了WenetSpeech-Pipe流程并发布了WenetSpeech-Yue语料库,该库使粤语ASR和TTS模型达到了与先进系统相媲美的性能。
English: The development of large-scale, high-quality speech datasets has advanced speech understanding and generation, yet Cantonese has lagged due to limited annotated resources, prompting the creation of WenetSpeech-Pipe and WenetSpeech-Yue, a comprehensive corpus that enables competitive ASR and TTS performance against SOTA systems.

Authors:Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, Huangyu Dai, Xing Xu, Tong Zhao, Mingcan Peng, Xiaoyang Zheng, Chao Wang, Qihang Zhao, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Yuqing Ding, Jing Chen, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai
Title: OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search
Abstract:
Traditional e-commerce search systems employ multi-stage cascading architectures (MCA) that progressively filter items through recall, pre-ranking, and ranking stages. While effective at balancing computational efficiency with business conversion, these systems suffer from fragmented computation and optimization objective collisions across stages, which ultimately limit their performance ceiling. To address these, we propose \textbf{OneSearch}, the first industrial-deployed end-to-end generative framework for e-commerce search. This framework introduces three key innovations: (1) a Keyword-enhanced Hierarchical Quantization Encoding (KHQE) module, to preserve both hierarchical semantics and distinctive item attributes while maintaining strong query-item relevance constraints; (2) a multi-view user behavior sequence injection strategy that constructs behavior-driven user IDs and incorporates both explicit short-term and implicit long-term sequences to model user preferences comprehensively; and (3) a Preference-Aware Reward System (PARS) featuring multi-stage supervised fine-tuning and adaptive reward-weighted ranking to capture fine-grained user preferences. Extensive offline evaluations on large-scale industry datasets demonstrate OneSearch's superior performance for high-quality recall and ranking. The rigorous online A/B tests confirm its ability to enhance relevance in the same exposure position, achieving statistically significant improvements: +1.67\% item CTR, +2.40\% buyer, and +3.22\% order volume. Furthermore, OneSearch reduces operational expenditure by 75.40\% and improves Model FLOPs Utilization from 3.26\% to 27.32\%. The system has been successfully deployed across multiple search scenarios in Kuaishou, serving millions of users, generating tens of millions of PVs daily.
中文摘要:OneSearch 作为首个工业级端到端生成式电商搜索框架,通过分层量化编码、多视角用户行为注入和偏好感知排序系统,解决了传统多阶段架构的优化目标冲突问题,显著提升了点击率、订单量并大幅降低了运营成本。
English Summary: OneSearch is an end-to-end generative e-commerce search framework that overcomes the limitations of traditional multi-stage systems by integrating hierarchical encoding, multi-view user behavior modeling, and preference-aware ranking, achieving significant improvements in relevance, user engagement, and operational efficiency.

Authors:Ben Chen, Xian Guo, Siyuan Wang, Zihan Liang, Yue Lv, Yufei Ma, Xinlong Xiao, Bowen Xue, Xuxin Zhang, Ying Yang, Huangyu Dai, Xing Xu, Tong Zhao, Mingcan Peng, Xiaoyang Zheng, Chao Wang, Qihang Zhao, Zhixin Zhai, Yang Zhao, Bochao Liu, Jingshan Lv, Xiao Liang, Yuqing Ding, Jing Chen, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai
Title: OneSearch: A Preliminary Exploration of the Unified End-to-End Generative Framework for E-commerce Search
Abstract:
Traditional e-commerce search systems employ multi-stage cascading architectures (MCA) that progressively filter items through recall, pre-ranking, and ranking stages. While effective at balancing computational efficiency with business conversion, these systems suffer from fragmented computation and optimization objective collisions across stages, which ultimately limit their performance ceiling. To address these, we propose \textbf{OneSearch}, the first industrial-deployed end-to-end generative framework for e-commerce search. This framework introduces three key innovations: (1) a Keyword-enhanced Hierarchical Quantization Encoding (KHQE) module, to preserve both hierarchical semantics and distinctive item attributes while maintaining strong query-item relevance constraints; (2) a multi-view user behavior sequence injection strategy that constructs behavior-driven user IDs and incorporates both explicit short-term and implicit long-term sequences to model user preferences comprehensively; and (3) a Preference-Aware Reward System (PARS) featuring multi-stage supervised fine-tuning and adaptive reward-weighted ranking to capture fine-grained user preferences. Extensive offline evaluations on large-scale industry datasets demonstrate OneSearch's superior performance for high-quality recall and ranking. The rigorous online A/B tests confirm its ability to enhance relevance in the same exposure position, achieving statistically significant improvements: +1.67% item CTR, +2.40% buyer, and +3.22% order volume. Furthermore, OneSearch reduces operational expenditure by 75.40% and improves Model FLOPs Utilization from 3.26% to 27.32%. The system has been successfully deployed across multiple search scenarios in Kuaishou, serving millions of users, generating tens of millions of PVs daily.
中文摘要:OneSearch 作为首个工业级端到端生成式电商搜索框架,通过分层量化编码、多视角用户行为注入和偏好感知排序系统,解决了传统多阶段架构的优化目标冲突问题,显著提升了点击率、订单量并大幅降低了运营成本。
English Summary: OneSearch is an end-to-end generative e-commerce search framework that overcomes the limitations of traditional multi-stage systems by integrating hierarchical encoding, multi-view user behavior modeling, and preference-aware ranking, achieving significant improvements in relevance, user engagement, and operational efficiency.

Authors:Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin
Title: Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
Abstract:
Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
Chinese: 研究人员提出了一种新的训练阶段,通过双向生成重构任务增强大语言模型中最终标记嵌入的语义,在MTEB基准测试中实现了最先进的性能。
English: Researchers propose a novel training stage using bidirectional generative reconstruction tasks to enhance the semantics of the final token embedding in large language models, achieving state-of-the-art performance on the Massive Text Embedding Benchmark.

Authors:Chang Su, Dengliang Shi, Siyuan Huang, Jintao Du, Changhua Meng, Yu Cheng, Weiqiang Wang, Zhouhan Lin
Title: Training LLMs to be Better Text Embedders through Bidirectional Reconstruction
Abstract:
Large language models (LLMs) have increasingly been explored as powerful text embedders. Existing LLM-based text embedding approaches often leverage the embedding of the final token, typically a reserved special token such as [EOS]. However, these tokens have not been intentionally trained to capture the semantics of the whole context, limiting their capacity as text embeddings, especially for retrieval and re-ranking tasks. We propose to add a new training stage before contrastive learning to enrich the semantics of the final token embedding. This stage employs bidirectional generative reconstruction tasks, namely EBQ2D (Embedding-Based Query-to-Document) and EBD2Q (Embedding-Based Document-to-Query), which interleave to anchor the [EOS] embedding and reconstruct either side of Query-Document pairs. Experimental results demonstrate that our additional training stage significantly improves LLM performance on the Massive Text Embedding Benchmark (MTEB), achieving new state-of-the-art results across different LLM base models and scales.
Chinese: 研究人员提出了一种新的训练阶段,通过双向生成重构任务增强大语言模型中最终标记嵌入的语义,在MTEB基准测试中实现了最先进的性能。
English: Researchers propose a novel training stage using bidirectional generative reconstruction tasks to enhance the semantics of the final token embedding in large language models, achieving state-of-the-art performance on the Massive Text Embedding Benchmark.

Authors:Robin Strässer, Karl Worthmann, Igor Mezić, Julian Berberich, Manuel Schaller, Frank Allgöwer
Title: An overview of Koopman-based control: From error bounds to closed-loop guarantees
Abstract:
Controlling nonlinear dynamical systems remains a central challenge in a wide range of applications, particularly when accurate first-principle models are unavailable. Data-driven approaches offer a promising alternative by designing controllers directly from observed trajectories. A wide range of data-driven methods relies on the Koopman-operator framework that enables linear representations of nonlinear dynamics via lifting into higher-dimensional observable spaces. Finite-dimensional approximations, such as extended dynamic mode decomposition (EDMD) and its controlled variants, make prediction and feedback control tractable but introduce approximation errors that must be accounted for to provide rigorous closed-loop guarantees. This survey provides a systematic overview of Koopman-based control, emphasizing the connection between data-driven surrogate models generated from finite data, approximation errors, controller design, and closed-loop guarantees. We review theoretical foundations, error bounds, and both linear and bilinear EDMD-based control schemes, highlighting robust strategies that ensure stability and performance. Finally, we discuss open challenges and future directions at the interface of operator theory, approximation theory, and nonlinear control.
中文: 本综述系统梳理了基于Koopman算子的非线性系统数据驱动控制方法,重点探讨了近似误差边界、具备稳定性保障的控制器设计,以及未来跨学科研究方向。
English: This survey systematically reviews Koopman-based control methods that use data-driven linear approximations for nonlinear systems, addressing error bounds, controller design with stability guarantees, and future research challenges.

Authors:Kun Xie, Feiyu Shen, Junjie Li, Fenglong Xie, Xu Tang, Yao Hu
Title: FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
Abstract:
Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at https://fireredteam.github.io/demos/firered_tts_2.
中文总结:FireRedTTS-2提出了一种流式多说话人对话系统,通过新型双变换器架构和流式分词器,解决了传统方法的不稳定合成、说话人切换不准等问题,实现了具有上下文一致韵律的自然语音生成。
English Summary: FireRedTTS-2 introduces a streaming multi-speaker dialogue system that overcomes limitations of conventional methods by providing stable speech with accurate speaker transitions and context-aware prosody through a novel dual-transformer architecture and streaming tokenizer.

Authors:Georgios Papoudakis, Thomas Coste, Jianye Hao, Jun Wang, Kun Shao
Title: Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control
Abstract:
Reinforcement learning (RL) using foundation models for policy approximations in multi-turn tasks remains challenging. We identify two main limitations related to sparse reward settings and policy gradient updates, based on which we formulate a key insight: updates from positive samples with high returns typically do not require policy regularisation, whereas updates from negative samples, reflecting undesirable behaviour, can harm model performance. This paper introduces Succeed or Learn Slowly (SoLS), a novel off-policy RL algorithm evaluated on mobile app control tasks. SoLS improves sample efficiency when fine-tuning foundation models for user interface navigation via a modified off-policy actor-critic approach, applying direct policy updates for positive samples and conservative, regularised updates for negative ones to prevent model degradation. We augment SoLS with Successful Transition Replay (STR), which prioritises learning from successful interactions, further improving sample efficiency. We evaluate SoLS on the AndroidWorld benchmark, where it significantly outperforms existing methods (at least 17% relative increase), including prompt-engineering and RL approaches, while requiring substantially fewer computational resources than GPT-4o-based methods with 5-60x faster inference.
中文摘要:本文提出“成功或慢学”(SoLS)算法,通过区分正负样本采用差异化更新策略——对正样本直接更新、对负样本保守更新,显著提升了移动应用控制任务中基础模型微调的样本效率,在AndroidWorld基准测试中以至少17%的相对优势超越现有方法。
English Summary: The paper introduces the Succeed or Learn Slowly (SoLS) algorithm, which enhances sample efficiency in fine-tuning foundation models for mobile app control by applying direct updates for positive samples and conservative updates for negative ones, outperforming existing methods with significant improvements.

Authors:Zeguan Xiao, Diyang Dou, Boya Xiong, Yun Chen, Guanhua Chen
Title: Enhancing Uncertainty Estimation in LLMs with Expectation of Aggregated Internal Belief
Abstract:
Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers. This overconfidence, especially in models undergone Reinforcement Learning from Human Feedback (RLHF), poses significant challenges for reliable uncertainty estimation and safe deployment. In this paper, we propose EAGLE (Expectation of AGgregated internaL bEief), a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores. Instead of relying on the model's final output, our approach extracts internal beliefs from multiple intermediate layers during self-evaluation. By aggregating these layer-wise beliefs and calculating the expectation over the resulting confidence score distribution, EAGLE produces a refined confidence score that more faithfully reflects the model's internal certainty. Extensive experiments on diverse datasets and LLMs demonstrate that EAGLE significantly improves calibration performance over existing baselines. We also provide an in-depth analysis of EAGLE, including a layer-wise examination of uncertainty patterns, a study of the impact of self-evaluation prompts, and an analysis of the effect of self-evaluation score range.
Chinese: 大型语言模型常对错误答案表现出过度自信,为此本文提出EAGLE校准方法,通过聚合模型多个中间层的内部隐藏状态来生成更准确的置信分数,从而提升模型可靠性。
English: Large Language Models often exhibit overconfidence in incorrect answers, so this paper introduces EAGLE, a calibration method that aggregates internal hidden states from multiple layers to produce more accurate confidence scores and improve model reliability.

Authors:Arijit Sarkar, Vaibhav Kumar Singh, Manuel Schaller, Karl Worthmann
Title: Energy-optimal control of discrete-time port-Hamiltonian systems
Abstract:
In this letter, we study the energy-optimal control of nonlinear port-Hamiltonian (pH) systems in discrete time. For continuous-time pH systems, energy-optimal control problems are strictly dissipative by design. This property, stating that the system to be optimized is dissipative with the cost functional as a supply rate, implies a stable long-term behavior of optimal solutions and enables stability results in predictive control. In this work, we show that the crucial property of strict dissipativity is not straightforwardly preserved by any energy-preserving integrator such as the implicit midpoint rule. Then, we prove that discretizations via difference and differential representations lead to strictly dissipative discrete-time optimal control problems. Consequently, we rigorously show a stable long-term behavior of optimal solutions in the form of a manifold (subspace) turnpike property. Finally, we validate our findings using two numerical examples
中文摘要:本文证明了端口哈密顿系统的能量最优控制中,严格耗散性这一关键性质无法通过标准离散化方法自动保持,但通过特定的差分和微分表示方法可实现,从而确保最优解具有流形转向点性质的稳定长期行为。
English summary: This letter demonstrates that strict dissipativity, crucial for stable long-term behavior in energy-optimal control of port-Hamiltonian systems, is not automatically preserved by standard discretization methods but can be achieved through specific difference and differential representations, leading to proven turnpike properties.

Authors:Aman Sharma, Saeed Najafi, Parsa Farinneya, Benyamin Jamialahmadi, Marzieh S. Tahaei, Yuhe Fan, Mehdi Rezagholizadeh, Boxing Chen, Aref Jafari
Title: DTRNet: Dynamic Token Routing Network to Reduce Quadratic Costs in Transformers
Abstract:
Transformers achieve state-of-the-art results across many tasks, but their uniform application of quadratic self-attention to every token at every layer makes them computationally expensive. We introduce DTRNet (Dynamic Token Routing Network), an improved Transformer architecture that allows tokens to dynamically skip the quadratic cost of cross-token mixing while still receiving lightweight linear updates. By preserving the MLP module and reducing the attention cost for most tokens to linear, DTRNet ensures that every token is explicitly updated while significantly lowering overall computation. This design offers an efficient and effective alternative to standard dense attention. Once trained, DTRNet blocks routes only ~10% of tokens through attention at each layer while maintaining performance comparable to a full Transformer. It consistently outperforms routing-based layer skipping methods such as MoD and D-LLM in both accuracy and memory at matched FLOPs, while routing fewer tokens to full attention. Its efficiency gains, scales with sequence length, offering significant reduction in FLOPs for long-context inputs. By decoupling token updates from attention mixing, DTRNet substantially reduces the quadratic share of computation, providing a simple, efficient, and scalable alternative to Transformers.
中文: DTRNet是一种改进的Transformer架构,通过动态路由使多数令牌跳过昂贵的二次注意力计算,在保持性能的同时显著降低计算成本,尤其适用于长序列处理。
English: DTRNet is an enhanced Transformer architecture that dynamically routes tokens to skip costly quadratic attention while maintaining performance, significantly reducing computational demands, especially for long sequences.

Authors:Zhichao Yan, Jiaoyan Chen, Jiapu Wang, Xiaoli Li, Ru Li, Jeff Z. Pan
Title: Decomposing and Revising What Language Models Generate
Abstract:
Attribution is crucial in question answering (QA) with Large Language Models (LLMs).SOTA question decomposition-based approaches use long form answers to generate questions for retrieving related documents. However, the generated questions are often irrelevant and incomplete, resulting in a loss of facts in retrieval.These approaches also fail to aggregate evidence snippets from different documents and paragraphs. To tackle these problems, we propose a new fact decomposition-based framework called FIDES (\textit{faithful context enhanced fact decomposition and evidence aggregation}) for attributed QA. FIDES uses a contextually enhanced two-stage faithful decomposition method to decompose long form answers into sub-facts, which are then used by a retriever to retrieve related evidence snippets. If the retrieved evidence snippets conflict with the related sub-facts, such sub-facts will be revised accordingly. Finally, the evidence snippets are aggregated according to the original sentences.Extensive evaluation has been conducted with six datasets, with an additionally proposed new metric called $Attr_{auto-P}$ for evaluating the evidence precision. FIDES outperforms the SOTA methods by over 14\% in average with GPT-3.5-turbo, Gemini and Llama 70B series.
Chinese: FIDES提出了一种基于事实分解的归因问答框架,通过上下文增强的分解和证据聚合方法,显著提升了检索准确性,并在多项测试中优于现有最优方法超过14%。
English: FIDES introduces a fact decomposition-based framework for attributed QA, enhancing question decomposition and evidence aggregation to improve retrieval accuracy and outperform SOTA methods by over 14%.

Authors:Hanqi Yan, Hainiu Xu, Yulan He
Title: Thinking Hard, Going Misaligned: Emergent Misalignment in LLMs
Abstract:
With Large Language Models (LLMs) becoming increasingly widely adopted, concerns regarding their safety and alignment with human values have intensified. Previous studies have shown that fine-tuning LLMs on narrow and malicious datasets induce misaligned behaviors. In this work, we report a more concerning phenomenon, Reasoning-Induced Misalignment. Specifically, we observe that LLMs become more responsive to malicious requests when reasoning is strengthened, via switching to "think-mode" or fine-tuning on benign math datasets, with dense models particularly vulnerable. Moreover, we analyze internal model states and find that both attention shifts and specialized experts in mixture-of-experts models help redirect excessive reasoning towards safety guardrails. These findings provide new insights into the emerging reasoning-safety trade-off and underscore the urgency of advancing alignment for advanced reasoning models.
中文: 本文揭示了推理诱发错位现象,即增强的推理能力会导致模型偏离人类价值观,并通过机制分析发现特定注意力头在训练过程中的神经元激活纠缠是造成这一脆弱性的原因。
English: This paper identifies Reasoning-Induced Misalignment (RIM), where enhanced reasoning capabilities cause models to deviate from human values, and reveals through mechanistic analysis that specific attention heads and neuron activation entanglement during training contribute to this vulnerability.

Authors:Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He
Title: When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
Abstract:
With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.
中文: 本文揭示了推理诱发错位现象,即增强的推理能力会导致模型偏离人类价值观,并通过机制分析发现特定注意力头在训练过程中的神经元激活纠缠是造成这一脆弱性的原因。
English: This paper identifies Reasoning-Induced Misalignment (RIM), where enhanced reasoning capabilities cause models to deviate from human values, and reveals through mechanistic analysis that specific attention heads and neuron activation entanglement during training contribute to this vulnerability.

Authors:Chengyue Wu, Hao Zhang, Shuchen Xue, Shizhe Diao, Yonggan Fu, Zhijian Liu, Pavlo Molchanov, Ping Luo, Song Han, Enze Xie
Title: Fast-dLLM v2: Efficient Block-Diffusion LLM
Abstract:
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation, requiring only approximately 1B tokens of fine-tuning. This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model's performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5x speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs - marking a significant step toward the practical deployment of fast and accurate LLMs. Code and model will be publicly released.
中文: Fast-dLLM v2 通过创新的块扩散机制和分层缓存设计,仅需少量微调数据即可将自回归大模型转化为并行文本生成模型,在保持精度的同时实现最高2.5倍的解码加速。
English: Fast-dLLM v2 efficiently transforms autoregressive LLMs into parallel text generation models using minimal fine-tuning data and a novel block diffusion mechanism, achieving up to 2.5x speedup while maintaining accuracy.

Authors:Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Title: The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
Abstract:
Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.
中文: 当前大语言模型评估系统存在明显偏见,倾向于选择近期回答和特定来源,却极少承认这些影响因素,表明其因依赖捷径且缺乏忠实性而不可靠。
English: Current LLM judges exhibit systematic biases favoring recent responses and specific sources, yet rarely acknowledge these influences, revealing them as unreliable evaluators due to shortcut dependency and lack of faithfulness.

Authors:Kai Guo, Xinnan Dai, Shenglai Zeng, Harry Shomer, Haoyu Han, Yu Wang, Jiliang Tang
Title: Beyond Static Retrieval: Opportunities and Pitfalls of Iterative Retrieval in GraphRAG
Abstract:
Retrieval-augmented generation (RAG) is a powerful paradigm for improving large language models (LLMs) on knowledge-intensive question answering. Graph-based RAG (GraphRAG) leverages entity-relation graphs to support multi-hop reasoning, but most systems still rely on static retrieval. When crucial evidence, especially bridge documents that connect disjoint entities, is absent, reasoning collapses and hallucinations persist. Iterative retrieval, which performs multiple rounds of evidence selection, has emerged as a promising alternative, yet its role within GraphRAG remains poorly understood. We present the first systematic study of iterative retrieval in GraphRAG, analyzing how different strategies interact with graph-based backbones and under what conditions they succeed or fail. Our findings reveal clear opportunities: iteration improves complex multi-hop questions, helps promote bridge documents into leading ranks, and different strategies offer complementary strengths. At the same time, pitfalls remain: naive expansion often introduces noise that reduces precision, gains are limited on single-hop or simple comparison questions, and several bridge evidences still be buried too deep to be effectively used. Together, these results highlight a central bottleneck, namely that GraphRAG's effectiveness depends not only on recall but also on whether bridge evidence is consistently promoted into leading positions where it can support reasoning chains. To address this challenge, we propose Bridge-Guided Dual-Thought-based Retrieval (BDTR), a simple yet effective framework that generates complementary thoughts and leverages reasoning chains to recalibrate rankings and bring bridge evidence into leading positions. BDTR achieves consistent improvements across diverse GraphRAG settings and provides guidance for the design of future GraphRAG systems.
中文: GraphRAG中的迭代检索通过提升关键桥梁证据来增强多跳推理,但仍面临噪声干扰和深层证据埋没的挑战,为此提出的BDTR框架通过重校准排名实现了持续改进。
English: Iterative retrieval in GraphRAG enhances multi-hop reasoning by promoting critical bridge evidence, yet faces challenges with noise and deep evidence burial, addressed by the proposed BDTR framework that recalibrates rankings for consistent improvements.

Authors:Chayapatr Archiwaranguprok, Awu Chen, Sheer Karny, Hiroshi Ishii, Pattie Maes, Pat Pataranutaporn
Title: Atlas of Human-AI Interaction (v1): An Interactive Meta-Science Platform for Large-Scale Research Literature Sensemaking
Abstract:
Human-AI interaction researchers face an overwhelming challenge: synthesizing insights from thousands of empirical studies to understand how AI impacts people and inform effective design. Existing approach for literature reviews cluster papers by similarities, keywords or citations, missing the crucial cause-and-effect relationships that reveal how design decisions impact user outcomes. We introduce the Atlas of Human-AI Interaction, an interactive web interface that provides the first systematic mapping of empirical findings across 1,000+ HCI papers using LLM-powered knowledge extraction. Our approach identifies causal relationships, and visualizes them through an AI-enabled interactive web interface as a navigable knowledge graph. We extracted 2,037 empirical findings, revealing research topic clusters, common themes, and disconnected areas. Expert evaluation with 20 researchers revealed the system's effectiveness for discovering research gaps. This work demonstrates how AI can transform literature synthesis itself, offering a scalable framework for evidence-based design, opening new possibilities for computational meta-science across HCI and beyond.
Chinese: 《人机交互图谱》通过基于大语言模型的知识提取系统,从千余篇人机交互论文中梳理并可视化因果关系,构建可交互知识图谱,帮助研究者发现研究空白,为人机交互及其他领域的计算元科学研究提供了新范式。
English: The Atlas of Human-AI Interaction addresses the challenge of synthesizing vast empirical literature by using an LLM-powered system to extract and visualize causal relationships from over 1,000 HCI papers, enabling researchers to identify research gaps through an interactive knowledge graph.

Authors:Zelin Tan, Hejia Geng, Mulei Zhang, Xiaohang Yu, Guancheng Wan, Yifan Zhou, Qiang He, Xiangyuan Xue, Heng Zhou, Yutao Fan, Zhongzhi Li, Zaibin Zhang, Guibin Zhang, Chen Zhang, Zhenfei Yin, Lei Bai
Title: Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning
Abstract:
While scaling laws for large language models (LLMs) during pre-training have been extensively studied, their behavior under reinforcement learning (RL) post-training remains largely unexplored. This paper presents a systematic empirical investigation of scaling behaviors in RL-based post-training, with a particular focus on mathematical reasoning. Based on 54 experiments across diverse model sizes and training settings, we characterize how model scale, data volume, and computational budget interact to shape performance. Our analysis leads to four key findings: (1). Under a fixed computational budget, larger models trained for fewer steps consistently outperform smaller models trained for more steps. (2). Given a fixed amount of training data, larger models achieve superior sample efficiency, yielding lower loss. (3). In data-constrained regimes, repeated reuse of high-quality data proves highly effective, as final performance is primarily governed by the total number of optimization steps rather than the uniqueness of samples. (4). These scaling behaviors are robust across both base and instruction-tuned models, which share similar learning dynamics (e.g., larger models show faster convergence) even while differing in absolute accuracy. Collectively, these results provide a principled foundation and practical guidelines for efficiently scaling the reasoning capabilities of LLMs through RL post-training.
中文摘要:本研究系统探讨了基于强化学习的后训练中模型规模、数据量与计算资源如何相互作用,发现较大模型在同等计算预算下始终优于较小模型,并展现出更优的样本效率。
English Summary: This study systematically investigates how model size, data volume, and computational resources interact during RL-based post-training for mathematical reasoning, revealing that larger models consistently outperform smaller ones under equivalent computational budgets and demonstrate superior sample efficiency.

Authors:Simon Welker, Lorenz Kuger, Tim Roith, Berthy Feng, Martin Burger, Timo Gerkmann, Henry Chapman
Title: Position-Blind Ptychography: Viability of image reconstruction via data-driven variational inference
Abstract:
In this work, we present and investigate the novel blind inverse problem of position-blind ptychography, i.e., ptychographic phase retrieval without any knowledge of scan positions, which then must be recovered jointly with the image. The motivation for this problem comes from single-particle diffractive X-ray imaging, where particles in random orientations are illuminated and a set of diffraction patterns is collected. If one uses a highly focused X-ray beam, the measurements would also become sensitive to the beam positions relative to each particle and therefore ptychographic, but these positions are also unknown. We investigate the viability of image reconstruction in a simulated, simplified 2-D variant of this difficult problem, using variational inference with modern data-driven image priors in the form of score-based diffusion models. We find that, with the right illumination structure and a strong prior, one can achieve reliable and successful image reconstructions even under measurement noise, in all except the most difficult evaluated imaging scenario.
中文: 本研究探索了位置盲ptychography这一新型盲逆问题,通过基于分数的扩散模型进行变分推理,在未知扫描位置的情况下联合恢复图像,在多数噪声场景中实现了可靠重建。
English: This study explores position-blind ptychography, a novel blind inverse problem where both the image and unknown scan positions are jointly recovered using variational inference with score-based diffusion models, achieving reliable reconstructions under noise in most scenarios.

Authors:Yuxin Jiang, Yuchao Gu, Yiren Song, Ivor Tsang, Mike Zheng Shou
Title: Personalized Vision via Visual In-Context Learning
Abstract:
Modern vision models, trained on large-scale annotated datasets, excel at predefined tasks but struggle with personalized vision -- tasks defined at test time by users with customized objects or novel objectives. Existing personalization approaches rely on costly fine-tuning or synthetic data pipelines, which are inflexible and restricted to fixed task formats. Visual in-context learning (ICL) offers a promising alternative, yet prior methods confine to narrow, in-domain tasks and fail to generalize to open-ended personalization. We introduce Personalized In-Context Operator (PICO), a simple four-panel framework that repurposes diffusion transformers as visual in-context learners. Given a single annotated exemplar, PICO infers the underlying transformation and applies it to new inputs without retraining. To enable this, we construct VisRel, a compact yet diverse tuning dataset, showing that task diversity, rather than scale, drives robust generalization. We further propose an attention-guided seed scorer that improves reliability via efficient inference scaling. Extensive experiments demonstrate that PICO (i) surpasses fine-tuning and synthetic-data baselines, (ii) flexibly adapts to novel user-defined tasks, and (iii) generalizes across both recognition and generation.
中文: PICO提出了一种四面板视觉上下文学习框架,利用扩散变换器实现无需重新训练的用户自定义任务个性化适配,在识别和生成任务中均优于现有方法。
English: PICO introduces a four-panel visual in-context learning framework that leverages diffusion transformers to enable flexible personalization for user-defined tasks without retraining, outperforming existing methods in both recognition and generation tasks.

Authors:Haoran He, Yuxiao Ye, Qingpeng Cai, Chen Hu, Binxing Jiao, Daxin Jiang, Ling Pan
Title: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
Abstract:
RL with Verifiable Rewards (RLVR) has emerged as a promising paradigm for improving the reasoning abilities of large language models (LLMs). Current methods rely primarily on policy optimization frameworks like PPO and GRPO, which follow generalized policy iteration that alternates between evaluating the current policy's value and improving the policy based on evaluation. While effective, they often suffer from training instability and diversity collapse, requiring complex heuristic tricks and careful tuning. We observe that standard RLVR in math reasoning can be formalized as a specialized finite-horizon Markov Decision Process with deterministic state transitions, tree-structured dynamics, and binary terminal rewards. Though large in scale, the underlying structure is simpler than general-purpose control settings for which popular RL algorithms (e.g., PPO) were developed, suggesting that several sophisticated techniques in existing methods may be reduced or even omitted. Based on this insight, we prove a surprising result: the optimal action can be recovered from the Q-function of a fixed uniformly random policy, thereby bypassing the generalized policy iteration loop and its associated heuristics. We introduce Random Policy Valuation for Diverse Reasoning (ROVER) to translate this principle into a practical and scalable algorithm for LLM math reasoning, a minimalist yet highly effective RL method that samples actions from a softmax over these uniform-policy Q-values. ROVER preserves diversity throughout training, allowing sustained exploration of multiple valid pathways. Across multiple base models and standard math reasoning benchmarks, ROVER demonstrates superior performance in both \textbf{quality} (\textbf{+8.2} on pass@1, \textbf{+16.8} on pass@256) and \textbf{diversity} (\textbf{+17.6\%}), despite its radical simplification compared to strong, complicated existing methods.
中文: 可验证奖励的强化学习(RLVR)能提升大语言模型的推理能力,但现有方法存在训练不稳定和多样性缺失问题;ROVER算法通过简化策略评估,利用均匀随机策略的Q值实现高效数学推理,显著提高了性能与多样性。
English: RL with Verifiable Rewards (RLVR) enhances LLM reasoning but faces instability and diversity issues with current methods, leading to the development of ROVER, a minimalist algorithm that uses uniform-policy Q-values to improve performance and diversity in math reasoning without complex heuristics.

Authors:Yanpeng Zhao, Shanyan Guan, Yunbo Wang, Yanhao Ge, Wei Li, Xiaokang Yang
Title: NeoWorld: Neural Simulation of Explorable Virtual Worlds via Progressive 3D Unfolding
Abstract:
We introduce NeoWorld, a deep learning framework for generating interactive 3D virtual worlds from a single input image. Inspired by the on-demand worldbuilding concept in the science fiction novel Simulacron-3 (1964), our system constructs expansive environments where only the regions actively explored by the user are rendered with high visual realism through object-centric 3D representations. Unlike previous approaches that rely on global world generation or 2D hallucination, NeoWorld models key foreground objects in full 3D, while synthesizing backgrounds and non-interacted regions in 2D to ensure efficiency. This hybrid scene structure, implemented with cutting-edge representation learning and object-to-3D techniques, enables flexible viewpoint manipulation and physically plausible scene animation, allowing users to control object appearance and dynamics using natural language commands. As users interact with the environment, the virtual world progressively unfolds with increasing 3D detail, delivering a dynamic, immersive, and visually coherent exploration experience. NeoWorld significantly outperforms existing 2D and depth-layered 2.5D methods on the WorldScore benchmark.
中文:NeoWorld是一个深度学习框架,能够从单张输入图像生成交互式3D虚拟世界,通过混合场景结构实现探索区域的高细节3D渲染与背景的2D合成,支持自然语言控制并在性能上超越现有方法。
English: NeoWorld is a deep learning framework that creates interactive 3D virtual worlds from a single image, using a hybrid approach to render high-detail 3D objects in explored areas while maintaining efficiency with 2D backgrounds, enabling natural language control and outperforming existing methods.

Authors:Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong, Serge Panev, Chen Gong, Thuy-Trang Vu, Gholamreza Haffari, Dinh Phung, Alan Wee-Chung Liew, Shirui Pan
Title: G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
Abstract:
Large language models (LLMs) excel at complex reasoning but remain limited by static and incomplete parametric knowledge. Retrieval-augmented generation (RAG) mitigates this by incorporating external knowledge, yet existing RAGs struggle with knowledge-intensive tasks due to fragmented information and weak modeling of knowledge structure. Graphs offer a natural way to model relationships within knowledge, but LLMs are inherently unstructured and cannot effectively reason over graph-structured data. Recent graph-enhanced RAG (GraphRAG) attempts to bridge this gap by constructing tailored graphs and enabling LLMs to reason on them. However, these methods often depend on ad-hoc graph designs, heuristic search, or costly agent pipelines, which hinder scalability and generalization. To address these challenges, we present G-reasoner, a unified framework that integrates graph and language foundation models for reasoning over diverse graph-structured knowledge. Central to our approach is QuadGraph, a standardized four-layer abstraction that unifies heterogeneous knowledge sources into a common graph representation. Building on this, we introduce a 34M-parameter graph foundation model (GFM) that jointly captures graph topology and textual semantics, and is integrated with LLMs to enhance reasoning in downstream applications. To ensure scalability and efficiency, mixed-precision training and distributed message-passing are implemented to scale GFM with more GPUs. Extensive experiments on six benchmarks show that G-reasoner consistently outperforms state-of-the-art baselines, significantly enhances LLM reasoning, and achieves strong efficiency and cross-graph generalization.
中文: G-reasoner提出了一种统一框架,通过标准化图抽象和轻量级图基础模型,将图与语言模型相结合,显著提升了推理能力并在多个基准测试中超越现有方法。
English: G-reasoner introduces a unified framework combining graph and language models with a standardized graph abstraction and a lightweight graph foundation model to significantly enhance reasoning capabilities and outperform existing methods across benchmarks.

Authors:Owen Lewis Howell, Linfeng Zhao, Xupeng Zhu, Yaoyao Qian, Haojie Huang, Lingfeng Sun, Wil Thomason, Robert Platt, Robin Walters
Title: Clebsch-Gordan Transformer: Fast and Global Equivariant Attention
Abstract:
The global attention mechanism is one of the keys to the success of transformer architecture, but it incurs quadratic computational costs in relation to the number of tokens. On the other hand, equivariant models, which leverage the underlying geometric structures of problem instance, often achieve superior accuracy in physical, biochemical, computer vision, and robotic tasks, at the cost of additional compute requirements. As a result, existing equivariant transformers only support low-order equivariant features and local context windows, limiting their expressiveness and performance. This work proposes Clebsch-Gordan Transformer, achieving efficient global attention by a novel Clebsch-Gordon Convolution on $\SO(3)$ irreducible representations. Our method enables equivariant modeling of features at all orders while achieving ${O}(N \log N)$ input token complexity. Additionally, the proposed method scales well with high-order irreducible features, by exploiting the sparsity of the Clebsch-Gordon matrix. Lastly, we also incorporate optional token permutation equivariance through either weight sharing or data augmentation. We benchmark our method on a diverse set of benchmarks including n-body simulation, QM9, ModelNet point cloud classification and a robotic grasping dataset, showing clear gains over existing equivariant transformers in GPU memory size, speed, and accuracy.
Chinese: Clebsch-Gordan Transformer通过SO(3)不可约表示上的Clebsch-Gordon卷积实现了高效全局注意力机制,在计算复杂度为O(N log N)的同时,在多个基准测试中相比现有等变transformer展现出更优的性能。
English: The Clebsch-Gordan Transformer introduces an efficient global attention mechanism using Clebsch-Gordon Convolution on SO(3) irreducible representations, achieving O(N log N) computational complexity and superior performance across various benchmarks compared to existing equivariant transformers.

Authors:Xiyan Xu, Sirui Xu, Yu-Xiong Wang, Liang-Yan Gui
Title: MoReact: Generating Reactive Motion from Textual Descriptions
Abstract:
Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person's motion to generate the other's reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other's actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent's movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance. Please find our webpage at https://xiyan-xu.github.io/MoReactWebPage.
中文摘要:本研究提出MoReact模型,通过基于交互场景文本描述的顺序生成全局轨迹和局部动作,实现了符合语义的逼真人体反应生成,有效提升了动态交互的适应性和真实感。
English Summary: The study introduces MoReact, a diffusion-based model that generates realistic human reactions by sequentially creating global trajectories and local motions based on textual descriptions of interaction scenarios, enhancing responsiveness and semantic alignment.

Authors:Aashnan Rahman, Abid Hasan, Sherajul Arifin, Faisal Haque Bappy, Tahrim Hossain, Tariqul Islam, Abu Raihan Mostofa Kamal, Md. Azam Hossain
Title: AntiFLipper: A Secure and Efficient Defense Against Label-Flipping Attacks in Federated Learning
Abstract:
Federated learning (FL) enables privacy-preserving model training by keeping data decentralized. However, it remains vulnerable to label-flipping attacks, where malicious clients manipulate labels to poison the global model. Despite their simplicity, these attacks can severely degrade model performance, and defending against them remains challenging. We introduce AntiFLipper, a novel and computationally efficient defense against multi-class label-flipping attacks in FL. Unlike existing methods that ensure security at the cost of high computational overhead, AntiFLipper employs a novel client-side detection strategy, significantly reducing the central server's burden during aggregation. Comprehensive empirical evaluations across multiple datasets under different distributions demonstrate that AntiFLipper achieves accuracy comparable to state-of-the-art defenses while requiring substantially fewer computational resources in server side. By balancing security and efficiency, AntiFLipper addresses a critical gap in existing defenses, making it particularly suitable for resource-constrained FL deployments where both model integrity and operational efficiency are essential.
Chinese: AntiFLipper是一种新颖高效的联邦学习标签翻转攻击防御方法,通过客户端检测策略降低服务器负担,在保证高准确性和安全性的同时,特别适合资源受限的部署环境。
English: AntiFLipper is a novel, efficient defense against label-flipping attacks in federated learning that uses client-side detection to reduce server burden while maintaining high accuracy and security.

Authors:Mengchen Zhao, Yifan Gao, Yaqing Hou, Xiangyang Li, Pengjie Gu, Zhenhua Dong, Ruiming Tang, Yi Cai
Title: MTRec: Learning to Align with User Preferences via Mental Reward Models
Abstract:
Recommendation models are predominantly trained using implicit user feedback, since explicit feedback is often costly to obtain. However, implicit feedback, such as clicks, does not always reflect users' real preferences. For example, a user might click on a news article because of its attractive headline, but end up feeling uncomfortable after reading the content. In the absence of explicit feedback, such erroneous implicit signals may severely mislead recommender systems. In this paper, we propose MTRec, a novel sequential recommendation framework designed to align with real user preferences by uncovering their internal satisfaction on recommended items. Specifically, we introduce a mental reward model to quantify user satisfaction and propose a distributional inverse reinforcement learning approach to learn it. The learned mental reward model is then used to guide recommendation models to better align with users' real preferences. Our experiments show that MTRec brings significant improvements to a variety of recommendation models. We also deploy MTRec on an industrial short video platform and observe a 7 percent increase in average user viewing time.
中文摘要:本文提出MTRec这一新颖的序列推荐框架,通过构建心理奖励模型和分布逆强化学习方法挖掘用户内在满意度,使推荐系统更贴合用户真实偏好,实验表明该方法能显著提升多种推荐模型性能,并在短视频平台实现用户平均观看时长增长7%。
English Summary: This paper introduces MTRec, a sequential recommendation framework that uses a mental reward model and distributional inverse reinforcement learning to better align with users' real preferences by uncovering their internal satisfaction, showing significant improvements in various models and a 7% increase in user viewing time on a video platform.

Authors:Yuma Fujimoto, Kenshi Abe, Kaito Ariu
Title: Learning from Delayed Feedback in Games via Extra Prediction
Abstract:
This study raises and addresses the problem of time-delayed feedback in learning in games. Because learning in games assumes that multiple agents independently learn their strategies, a discrepancy in optimization often emerges among the agents. To overcome this discrepancy, the prediction of the future reward is incorporated into algorithms, typically known as Optimistic Follow-the-Regularized-Leader (OFTRL). However, the time delay in observing the past rewards hinders the prediction. Indeed, this study firstly proves that even a single-step delay worsens the performance of OFTRL from the aspects of regret and convergence. This study proposes the weighted OFTRL (WOFTRL), where the prediction vector of the next reward in OFTRL is weighted $n$ times. We further capture an intuition that the optimistic weight cancels out this time delay. We prove that when the optimistic weight exceeds the time delay, our WOFTRL recovers the good performances that the regret is constant ($O(1)$-regret) in general-sum normal-form games, and the strategies converge to the Nash equilibrium as a subsequence (best-iterate convergence) in poly-matrix zero-sum games. The theoretical results are supported and strengthened by our experiments.
中文: 本研究提出加权乐观跟随正则化领导者算法,通过增强乐观权重抵消多智能体学习中的时间延迟反馈,理论证明当权重超过延迟时能实现常数级遗憾和纳什均衡收敛。
English: This study introduces Weighted Optimistic Follow-the-Regularized-Leader (WOFTRL) to counteract performance degradation caused by time-delayed feedback in multi-agent learning, proving that sufficient optimism compensates for delays to achieve constant regret and convergence to Nash equilibrium.

Authors:Xiaochong Lan, Yu Zheng, Shiteng Cao, Yong Li
Title: The Thinking Spectrum: An Empirical Study of Tunable Reasoning in LLMs through Model Merging
Abstract:
The growing demand for large language models (LLMs) with tunable reasoning capabilities in many real-world applications highlights a critical need for methods that can efficiently produce a spectrum of models balancing reasoning depth and computational cost. Model merging has emerged as a promising, training-free technique to address this challenge by arithmetically combining the weights of a general-purpose model with a specialized reasoning model. While various merging techniques exist, their potential to create a spectrum of models with fine-grained control over reasoning abilities remains largely unexplored. This work presents a large-scale empirical study evaluating a range of model merging techniques across multiple reasoning benchmarks. We systematically vary merging strengths to construct accuracy-efficiency curves, providing the first comprehensive view of the tunable performance landscape. Our findings reveal that model merging offers an effective and controllable method for calibrating the trade-off between reasoning accuracy and token efficiency, even when parent models have highly divergent weight spaces. Crucially, we identify instances of Pareto Improvement, where a merged model achieves both higher accuracy and lower token consumption than one of its parents. Our study provides the first comprehensive analysis of this tunable space, offering practical guidelines for creating LLMs with specific reasoning profiles to meet diverse application demands.
中文摘要:模型融合通过算术组合通用模型与专业推理模型的权重,实现了在推理精度与计算成本间可调控的平衡,为不同应用需求提供了高效灵活的解决方案。
English Summary: Model merging enables efficient creation of LLMs with tunable reasoning capabilities by balancing accuracy and computational cost through arithmetic weight combinations, offering practical solutions for diverse application needs.

Authors:Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Title: Evaluating the Evaluators: Metrics for Compositional Text-to-Image Generation
Abstract:
Text-image generation has advanced rapidly, but assessing whether outputs truly capture the objects, attributes, and relations described in prompts remains a central challenge. Evaluation in this space relies heavily on automated metrics, yet these are often adopted by convention or popularity rather than validated against human judgment. Because evaluation and reported progress in the field depend directly on these metrics, it is critical to understand how well they reflect human preferences. To address this, we present a broad study of widely used metrics for compositional text-image evaluation. Our analysis goes beyond simple correlation, examining their behavior across diverse compositional challenges and comparing how different metric families align with human judgments. The results show that no single metric performs consistently across tasks: performance varies with the type of compositional problem. Notably, VQA-based metrics, though popular, are not uniformly superior, while certain embedding-based metrics prove stronger in specific cases. Image-only metrics, as expected, contribute little to compositional evaluation, as they are designed for perceptual quality rather than alignment. These findings underscore the importance of careful and transparent metric selection, both for trustworthy evaluation and for their use as reward models in generation. Project page is available at \href{https://amirkasaei.com/eval-the-evals/}{this URL}.
中文: 当前文本到图像生成的自动化评估指标在不同组合任务中与人类判断的一致性不足,需要更透明和针对任务特点的指标选择以确保评估可靠性。
English: Current automated metrics for evaluating text-image generation lack consistent alignment with human judgment across different compositional tasks, necessitating more transparent and task-specific metric selection for reliable assessment.

Authors:William Barron, Xiaoxiang Dong, Matthew Johnson-Roberson, Weiming Zhi
Title: Cross-Modal Instructions for Robot Motion Generation
Abstract:
Teaching robots novel behaviors typically requires motion demonstrations via teleoperation or kinaesthetic teaching, that is, physically guiding the robot. While recent work has explored using human sketches to specify desired behaviors, data collection remains cumbersome, and demonstration datasets are difficult to scale. In this paper, we introduce an alternative paradigm, Learning from Cross-Modal Instructions, where robots are shaped by demonstrations in the form of rough annotations, which can contain free-form text labels, and are used in lieu of physical motion. We introduce the CrossInstruct framework, which integrates cross-modal instructions as examples into the context input to a foundational vision-language model (VLM). The VLM then iteratively queries a smaller, fine-tuned model, and synthesizes the desired motion over multiple 2D views. These are then subsequently fused into a coherent distribution over 3D motion trajectories in the robot's workspace. By incorporating the reasoning of the large VLM with a fine-grained pointing model, CrossInstruct produces executable robot behaviors that generalize beyond the environment of in the limited set of instruction examples. We then introduce a downstream reinforcement learning pipeline that leverages CrossInstruct outputs to efficiently learn policies to complete fine-grained tasks. We rigorously evaluate CrossInstruct on benchmark simulation tasks and real hardware, demonstrating effectiveness without additional fine-tuning and providing a strong initialization for policies subsequently refined via reinforcement learning.
中文摘要:本文提出CrossInstruct框架,通过跨模态指令(如文本标注)替代物理演示来教导机器人行为,利用视觉语言模型生成3D运动轨迹,并结合强化学习优化策略实现任务泛化。
English Summary: The paper introduces CrossInstruct, a framework that enables robots to learn behaviors from cross-modal instructions like text annotations instead of physical demonstrations, using vision-language models to generate 3D motion trajectories and reinforcement learning for policy refinement.

Authors:Zhengyuan Shi, Jingxin Wang, Wentao Jiang, Chengyu Ma, Ziyang Zheng, Zhufei Chu, Weikang Qian, Qiang Xu
Title: Alignment Unlocks Complementarity: A Framework for Multiview Circuit Representation Learning
Abstract:
Multiview learning on Boolean circuits holds immense promise, as different graph-based representations offer complementary structural and semantic information. However, the vast structural heterogeneity between views, such as an And-Inverter Graph (AIG) versus an XOR-Majority Graph (XMG), poses a critical barrier to effective fusion, especially for self-supervised techniques like masked modeling. Naively applying such methods fails, as the cross-view context is perceived as noise. Our key insight is that functional alignment is a necessary precondition to unlock the power of multiview self-supervision. We introduce MixGate, a framework built on a principled training curriculum that first teaches the model a shared, function-aware representation space via an Equivalence Alignment Loss. Only then do we introduce a multiview masked modeling objective, which can now leverage the aligned views as a rich, complementary signal. Extensive experiments, including a crucial ablation study, demonstrate that our alignment-first strategy transforms masked modeling from an ineffective technique into a powerful performance driver.
Chinese: MixGate提出了一种框架,首先通过等价对齐损失实现异构布尔电路视图间的功能对齐,从而使得多视图掩码建模能够有效利用互补信号,显著提升性能。
English: MixGate introduces a framework that first achieves functional alignment between heterogeneous Boolean circuit views through an Equivalence Alignment Loss, enabling effective multiview self-supervised learning via masked modeling that significantly boosts performance.

Authors:Yanghe Pan, Yuntao Wang, Shaolong Guo, Chengyu Yin, Ruidong Li, Zhou Su, Yuan Wu
Title: Trustworthy Semantic Communication for Vehicular Networks: Challenges and Solutions
Abstract:
Semantic communication (SemCom) has the potential to significantly reduce communication delay in vehicle-to-everything (V2X) communications within vehicular networks (VNs). However, the deployment of vehicular SemCom networks (VN-SemComNets) faces critical trust challenges in information transmission, semantic encoding, and communication entity reliability. This paper proposes an innovative three-layer trustworthy VN-SemComNet architecture. Specifically, we introduce a semantic camouflage transmission mechanism leveraging defensive adversarial noise for active eavesdropping defense, a robust federated encoder-decoder training framework to mitigate encoder-decoder poisoning attacks, and an audit game-based distributed vehicle trust management mechanism to deter untrustworthy vehicles. A case study validates the effectiveness of the proposed solutions. Lastly, essential future research directions are pointed out to advance this emerging field.
Chinese: 本文提出了一种三层可信车载语义通信网络架构,通过主动窃听防御机制、鲁棒联邦训练框架和分布式信任管理机制,有效解决了语义通信在车联网中的安全挑战。
English: This paper proposes a three-layer trustworthy semantic communication network architecture for vehicular networks, incorporating mechanisms for active eavesdropping defense, robust federated training, and distributed trust management to address security challenges.

Authors:Yuhong Zhang, Han Wang, Yiwen Wang, Rong Xie, Li Song
Title: FreeInsert: Personalized Object Insertion with Geometric and Style Control
Abstract:
Text-to-image diffusion models have made significant progress in image generation, allowing for effortless customized generation. However, existing image editing methods still face certain limitations when dealing with personalized image composition tasks. First, there is the issue of lack of geometric control over the inserted objects. Current methods are confined to 2D space and typically rely on textual instructions, making it challenging to maintain precise geometric control over the objects. Second, there is the challenge of style consistency. Existing methods often overlook the style consistency between the inserted object and the background, resulting in a lack of realism. In addition, the challenge of inserting objects into images without extensive training remains significant. To address these issues, we propose \textit{FreeInsert}, a novel training-free framework that customizes object insertion into arbitrary scenes by leveraging 3D geometric information. Benefiting from the advances in existing 3D generation models, we first convert the 2D object into 3D, perform interactive editing at the 3D level, and then re-render it into a 2D image from a specified view. This process introduces geometric controls such as shape or view. The rendered image, serving as geometric control, is combined with style and content control achieved through diffusion adapters, ultimately producing geometrically controlled, style-consistent edited images via the diffusion model.
中文: 文本到图像扩散模型在个性化图像编辑中存在几何控制和风格一致性的问题,因此提出的免训练框架FreeInsert利用三维几何信息和扩散适配器,实现了精确可控的对象插入。
English: Text-to-image diffusion models face challenges in geometric control and style consistency for personalized image editing, which the proposed training-free framework FreeInsert addresses by utilizing 3D geometric information and diffusion adapters to achieve precise object insertion.

Authors:Xiaoxiang Dong, Matthew Johnson-Roberson, Weiming Zhi
Title: Joint Flow Trajectory Optimization For Feasible Robot Motion Generation from Video Demonstrations
Abstract:
Learning from human video demonstrations offers a scalable alternative to teleoperation or kinesthetic teaching, but poses challenges for robot manipulators due to embodiment differences and joint feasibility constraints. We address this problem by proposing the Joint Flow Trajectory Optimization (JFTO) framework for grasp pose generation and object trajectory imitation under the video-based Learning-from-Demonstration (LfD) paradigm. Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides, balancing three objectives: (i) selecting a feasible grasp pose, (ii) generating object trajectories consistent with demonstrated motions, and (iii) ensuring collision-free execution within robot kinematics. To capture the multimodal nature of demonstrations, we extend flow matching to $\SE(3)$ for probabilistic modeling of object trajectories, enabling density-aware imitation that avoids mode collapse. The resulting optimization integrates grasp similarity, trajectory likelihood, and collision penalties into a unified differentiable objective. We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.
中文:JFTO框架将人类视频演示视为以物体为中心的指导,通过统一可微分目标优化抓取姿态和轨迹,确保机器人能够在满足运动可行性和避障条件下进行模仿学习。
English: The JFTO framework enables robots to learn from human video demonstrations by treating them as object-centric guides, optimizing grasp poses and trajectories while ensuring feasibility and collision avoidance through a unified differentiable objective.

Authors:Khai Nguyen, Hai Nguyen, Nhat Ho
Title: Fast Estimation of Wasserstein Distances via Regression on Sliced Wasserstein Distances
Abstract:
We address the problem of efficiently computing Wasserstein distances for multiple pairs of distributions drawn from a meta-distribution. To this end, we propose a fast estimation method based on regressing Wasserstein distance on sliced Wasserstein (SW) distances. Specifically, we leverage both standard SW distances, which provide lower bounds, and lifted SW distances, which provide upper bounds, as predictors of the true Wasserstein distance. To ensure parsimony, we introduce two linear models: an unconstrained model with a closed-form least-squares solution, and a constrained model that uses only half as many parameters. We show that accurate models can be learned from a small number of distribution pairs. Once estimated, the model can predict the Wasserstein distance for any pair of distributions via a linear combination of SW distances, making it highly efficient. Empirically, we validate our approach on diverse tasks, including Gaussian mixtures, point-cloud classification, and Wasserstein-space visualizations for 3D point clouds. Across various datasets such as MNIST point clouds, ShapeNetV2, MERFISH Cell Niches, and scRNA-seq, our method consistently provides a better approximation of Wasserstein distance than the state-of-the-art Wasserstein embedding model, Wasserstein Wormhole, particularly in low-data regimes. Finally, we demonstrate that our estimator can also accelerate Wormhole training, yielding \textit{RG-Wormhole}.
中文总结: 本文提出了一种快速估计Wasserstein距离的方法,通过将其回归到切片Wasserstein距离上,利用上下界构建高效线性模型,在多种数据集上均优于现有方法。
English Summary: This paper introduces a fast method for estimating Wasserstein distances by regressing them on sliced Wasserstein distances, using both lower and upper bounds to create efficient linear models that outperform existing approaches across various datasets.

Authors:Vivek Myers, Bill Chunyuan Zheng, Benjamin Eysenbach, Sergey Levine
Title: Offline Goal-conditioned Reinforcement Learning with Quasimetric Representations
Abstract:
Approaches for goal-conditioned reinforcement learning (GCRL) often use learned state representations to extract goal-reaching policies. Two frameworks for representation structure have yielded particularly effective GCRL algorithms: (1) *contrastive representations*, in which methods learn "successor features" with a contrastive objective that performs inference over future outcomes, and (2) *temporal distances*, which link the (quasimetric) distance in representation space to the transit time from states to goals. We propose an approach that unifies these two frameworks, using the structure of a quasimetric representation space (triangle inequality) with the right additional constraints to learn successor representations that enable optimal goal-reaching. Unlike past work, our approach is able to exploit a **quasimetric** distance parameterization to learn **optimal** goal-reaching distances, even with **suboptimal** data and in **stochastic** environments. This gives us the best of both worlds: we retain the stability and long-horizon capabilities of Monte Carlo contrastive RL methods, while getting the free stitching capabilities of quasimetric network parameterizations. On existing offline GCRL benchmarks, our representation learning objective improves performance on stitching tasks where methods based on contrastive learning struggle, and on noisy, high-dimensional environments where methods based on quasimetric networks struggle.
中文摘要:本文提出一种统一目标条件强化学习的方法,将对比表征与时间距离相结合来学习最优目标达成策略,在拼接任务和噪声环境中的表现优于现有方法。
English Summary: This paper introduces a unified approach for goal-conditioned reinforcement learning that combines contrastive representations and temporal distances to learn optimal goal-reaching policies, demonstrating superior performance in stitching tasks and noisy environments compared to existing methods.

Authors:Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu
Title: PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation
Abstract:
Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl
中文:PhysCtrl是一种基于物理学的创新框架,通过基于合成动画训练的扩散模型学习物理动力学,生成逼真且可控的视频,在视觉质量和物理合理性方面均优于现有方法。
English: PhysCtrl is a novel physics-grounded framework that generates realistic and controllable videos by learning physical dynamics through a diffusion model trained on synthetic animations, outperforming existing methods in both visual quality and physical plausibility.

Authors:Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, Xuming Hu
Title: Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage
Abstract:
Large-language-model (LLM) reasoning has long been regarded as a powerful tool for problem solving across domains, providing non-experts with valuable advice. However, their limitations - especially those stemming from prompt design - remain underexplored. Because users may supply biased or incomplete prompts - often unintentionally - LLMs can be misled, undermining reliability and creating risks. We refer to this vulnerability as the Instruction Boundary. To investigate the phenomenon, we distill it into eight concrete facets and introduce BiasDetector, a framework that measures biases arising from three instruction types: complete, redundant, and insufficient. We evaluate several mainstream LLMs and find that, despite high headline accuracy, substantial biases persist in many downstream tasks as a direct consequence of prompt coverage. Our empirical study confirms that LLM reasoning reliability can still be significantly improved. We analyze the practical impact of these biases and outline mitigation strategies. Our findings underscore the need for developers to tackle biases and for users to craft options carefully.
中文摘要:本研究提出"指令边界"概念,分析不同提示覆盖度如何导致大语言模型产生推理偏差,并通过BiasDetector框架量化模型识别稀疏标签的能力,发现即使准确率很高,下游任务中仍存在显著偏差。
English Summary: This study introduces the concept of Instruction Boundary to analyze how varying prompt coverage affects LLM reasoning biases, proposing BiasDetector to quantify their ability to identify flawed data patterns called sparse labels, revealing persistent biases despite high accuracy.

Authors:Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, Xuming Hu
Title: Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage
Abstract:
Nowadays, automatically generated datasets are increasingly used in LLM reasoning tasks; however, large-scale corpora often contain inherent flaws. For example, a single-choice question may include none or multiple correct options, while true-or-false questions may involve vague or unverifiable statements. We refer to these exceptional answer forms as sparse labels. To compare LLMs' ability to recognize various question forms and produce correct answers, we investigate how different instruction formats can either facilitate or mislead LLM reasoning ability. We introduce the concept of Instruction Boundary, which systematically analyzes how different levels of prompt coverage -- sufficient, redundant, or insufficient -- can lead to reasoning biases and performance changes in LLMs. To examine this phenomenon, we design eight experimental settings across five dataset forms. We further propose BiasDetector, a unified framework that quantifies LLMs' ability to identify sparse labels under different kinds of Instruction Boundary conditions. Evaluations on five mainstream LLMs show that, despite their seemingly high accuracy, substantial reasoning biases persist in many downstream tasks as a direct consequence of prompt coverage. We analyze the impact of these biases and outline possible mitigation strategies. Our findings highlight not only the importance of addressing sparse labels, but also the need for developers to recognize and mitigate the risks introduced by Instruction Boundary.
中文摘要:本研究提出"指令边界"概念,分析不同提示覆盖度如何导致大语言模型产生推理偏差,并通过BiasDetector框架量化模型识别稀疏标签的能力,发现即使准确率很高,下游任务中仍存在显著偏差。
English Summary: This study introduces the concept of Instruction Boundary to analyze how varying prompt coverage affects LLM reasoning biases, proposing BiasDetector to quantify their ability to identify flawed data patterns called sparse labels, revealing persistent biases despite high accuracy.

Authors:Ziyi Han, Xutong Liu, Ruiting Zhou, Xiangxiang Dai, John C. S. Lui
Title: Faster, Smaller, and Smarter: Task-Aware Expert Merging for Online MoE Inference
Abstract:
Sparse Mixture of Experts (SMoE) has become a preferred architecture for scaling Transformer capacity without increasing computational cost, as it activates only a small subset of experts for each input. However, deploying such an approach for \textit{online inference} remains challenging due to the large size of a full SMoE model and the complexity of expert routing, especially in resource-constrained edge networks. Moreover, during the online inference, task information is often unavailable, making the task-level routing error-prone. In this work, we propose a novel tree-structured adaptive neural bandit router, \texttt{Tanbr}, to enable efficient and reliable online MoE inference. Instead of relying on explicit task tags, \texttt{Tanbr} estimates the task distribution over time from historical data and uses it to guide task-aware expert merging within a given pre-trained MoE. To handle the large continuous space of merging weights, \texttt{Tanbr} employs a binary tree to progressively partition the space and generate finer candidate weights. It then applies a neural bandit to learn the non-linear mapping from merging weight to model performance and decides optimal expert merging. We prove that \texttt{Tanbr} achieves a sublinear regret bound of {\small $\mathcal{O}(\sqrt{T} \log(T))$} over {\small $T$} rounds, despite operating over a continuous decision space, matching regret bounds compared to existing methods. Extensive experiments show that \texttt{Tanbr} reduces inference latency by at least {\small $45\%$} and memory usage by up to {\small $25\%$}, while maintaining a high accuracy compared to many state-of-the-art methods.
中文: 提出的Tanbr路由器通过基于估计任务分布自适应合并专家,实现了稀疏专家混合模型的高效在线推理,在保持高精度的同时显著降低了延迟和内存使用。
English: The proposed Tanbr router enables efficient online inference for Sparse Mixture of Experts models by adaptively merging experts based on estimated task distributions, achieving significant reductions in latency and memory usage while maintaining high accuracy.

Authors:Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Tom Stapleford, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh V. Chawla
Title: LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines
Abstract:
Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.
中文: 本文综述了前沿大语言模型及其在多学科领域的整合应用,探讨了它们对研究和实践的变革性影响,同时分析了生成式人工智能时代的关键挑战与未来方向。
English: This paper provides a comprehensive overview of state-of-the-art Large Language Models (LLMs) and their integration across diverse academic disciplines, exploring their transformative potential in research and practice while addressing limitations and future directions in the generative AI era.

Authors:Saeed Almheiri, Rania Hossam, Mena Attia, Chenxi Wang, Preslav Nakov, Timothy Baldwin, Fajri Koto
Title: Cross-Cultural Transfer of Commonsense Reasoning in LLMs: Evidence from the Arab World
Abstract:
Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning and direct preference optimization. Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10\% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond the Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.
Chinese: 研究表明,仅使用来自某个阿拉伯国家的12个文化特定示例,轻量级对齐方法就能将其他阿拉伯国家的常识推理性能平均提升10%,这种跨文化可转移性甚至延伸至印度尼西亚和美国等非阿拉伯语境。
English: This study demonstrates that lightweight alignment methods using just 12 culture-specific examples from one Arab country can boost commonsense reasoning performance in other Arab countries by 10% on average, revealing cross-cultural transferability that extends even to non-Arab contexts like Indonesia and the US.

Authors:Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
Title: An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Abstract:
Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.
中文: 基于基础模型的AI代理因非确定性面临测试挑战,研究发现传统方法占主导,而关键组件如提示语测试严重不足,亟需改进测试实践以增强可靠性。
English: Foundation model-based AI agents face testing challenges due to their non-determinism, with a study revealing that traditional methods dominate while critical components like prompts remain largely untested, highlighting a need for improved testing practices.

Authors:Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Title: CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration
Abstract:
Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/{this URL}.
中文:CARINOX是一个统一框架,结合噪声优化与探索以及基于人类判断相关性的奖励选择原则,有效提升了文本到图像扩散模型的组合对齐能力,在基准测试中显著提高了对齐分数,同时保持了图像质量和多样性。
English: CARINOX is a unified framework that combines noise optimization and exploration with a principled reward selection to enhance compositional alignment in text-to-image diffusion models, achieving significant improvements in alignment scores on benchmarks while preserving image quality and diversity.

Authors:Xiwei Zhao, Yiwei Wang, Yansong Wu, Fan Wu, Teng Sun, Zhonghua Miao, Sami Haddadin, Alois Knoll
Title: Video-to-BT: Generating Reactive Behavior Trees from Human Demonstration Videos for Robotic Assembly
Abstract:
Modern manufacturing demands robotic assembly systems with enhanced flexibility and reliability. However, traditional approaches often rely on programming tailored to each product by experts for fixed settings, which are inherently inflexible to product changes and lack the robustness to handle variations. As Behavior Trees (BTs) are increasingly used in robotics for their modularity and reactivity, we propose a novel hierarchical framework, Video-to-BT, that seamlessly integrates high-level cognitive planning with low-level reactive control, with BTs serving both as the structured output of planning and as the governing structure for execution. Our approach leverages a Vision-Language Model (VLM) to decompose human demonstration videos into subtasks, from which Behavior Trees are generated. During the execution, the planned BTs combined with real-time scene interpretation enable the system to operate reactively in the dynamic environment, while VLM-driven replanning is triggered upon execution failure. This closed-loop architecture ensures stability and adaptivity. We validate our framework on real-world assembly tasks through a series of experiments, demonstrating high planning reliability, robust performance in long-horizon assembly tasks, and strong generalization across diverse and perturbed conditions. Project website: https://video2bt.github.io/video2bt_page/
中文摘要:提出的Video-to-BT框架通过视觉语言模型将装配演示转化为行为树,实现闭环规划与执行的机器人反应式控制,能自适应动态环境变化。
English Summary: The proposed Video-to-BT framework uses vision-language models to convert assembly demonstrations into behavior trees, enabling reactive robotic control that adapts to dynamic environments through closed-loop planning and execution.

Authors:Mingdong Wu, Long Yang, Jin Liu, Weiyao Huang, Lehong Wu, Zelin Chen, Daolin Ma, Hao Dong
Title: UniTac2Pose: A Unified Approach Learned in Simulation for Category-level Visuotactile In-hand Pose Estimation
Abstract:
Accurate estimation of the in-hand pose of an object based on its CAD model is crucial in both industrial applications and everyday tasks, ranging from positioning workpieces and assembling components to seamlessly inserting devices like USB connectors. While existing methods often rely on regression, feature matching, or registration techniques, achieving high precision and generalizability to unseen CAD models remains a significant challenge. In this paper, we propose a novel three-stage framework for in-hand pose estimation. The first stage involves sampling and pre-ranking pose candidates, followed by iterative refinement of these candidates in the second stage. In the final stage, post-ranking is applied to identify the most likely pose candidates. These stages are governed by a unified energy-based diffusion model, which is trained solely on simulated data. This energy model simultaneously generates gradients to refine pose estimates and produces an energy scalar that quantifies the quality of the pose estimates. Additionally, borrowing the idea from the computer vision domain, we incorporate a render-compare architecture within the energy-based score network to significantly enhance sim-to-real performance, as demonstrated by our ablation studies. We conduct comprehensive experiments to show that our method outperforms conventional baselines based on regression, matching, and registration techniques, while also exhibiting strong intra-category generalization to previously unseen CAD models. Moreover, our approach integrates tactile object pose estimation, pose tracking, and uncertainty estimation into a unified framework, enabling robust performance across a variety of real-world conditions.
中文摘要:本文提出了一种基于能量扩散模型的三阶段手内物体姿态估计框架,该框架仅使用模拟数据训练,在精度和泛化能力上超越传统方法,并能将触觉姿态估计、姿态跟踪和不确定性评估统一集成。
English Summary: This paper introduces a novel three-stage framework for in-hand object pose estimation using a unified energy-based diffusion model trained on simulated data, which outperforms existing methods and demonstrates strong generalization to unseen CAD models while integrating multiple functionalities into a unified system.

Authors:Yi Dong, Zhongguo Li, Sarvapali D. Ramchurn, Xiaowei Huang
Title: Distributed Nash Equilibrium Seeking Algorithm in Aggregative Games for Heterogeneous Multi-Robot Systems
Abstract:
This paper develops a distributed Nash Equilibrium seeking algorithm for heterogeneous multi-robot systems. The algorithm utilises distributed optimisation and output control to achieve the Nash equilibrium by leveraging information shared among neighbouring robots. Specifically, we propose a distributed optimisation algorithm that calculates the Nash equilibrium as a tailored reference for each robot and designs output control laws for heterogeneous multi-robot systems to track it in an aggregative game. We prove that our algorithm is guaranteed to converge and result in efficient outcomes. The effectiveness of our approach is demonstrated through numerical simulations and empirical testing with physical robots.
中文: 本文针对异构多机器人系统提出了一种分布式纳什均衡搜索算法,通过分布式优化和输出控制相结合的方法确保收敛性和效率,并经过仿真与实物机器人实验验证了有效性。
English: This paper presents a distributed Nash Equilibrium seeking algorithm for heterogeneous multi-robot systems, combining distributed optimization and output control to ensure convergence and efficiency, as validated through simulations and physical experiments.

Authors:Simon Welker, Tal Peer, Timo Gerkmann
Title: Real-Time Streaming Mel Vocoding with Generative Flow Matching
Abstract:
The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank, we develop MelFlow, a streaming-capable generative Mel vocoder for speech sampled at 16 kHz with an algorithmic latency of only 32 ms and a total latency of 48 ms. We show real-time streaming capability at this latency not only in theory, but in practice on a consumer laptop GPU. Furthermore, we show that our model achieves substantially better PESQ and SI-SDR values compared to well-established not streaming-capable baselines for Mel vocoding including HiFi-GAN.
中文:MelFlow是一种低延迟生成式声码器,能够实时将梅尔频谱图转换为音频,并在质量和效率上均优于现有模型。
English: MelFlow is a low-latency generative vocoder that enables real-time streaming conversion of Mel spectrograms to audio, outperforming existing models in both quality and efficiency.

Authors:Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng
Title: Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration
Abstract:
Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.
中文摘要:本研究提出Align3方法,通过测试时审议帮助大语言模型适应不同场景下的动态行为与安全规范,并推出SpecBench基准,证明该方法能以最小成本有效提升规范对齐能力。
English Summary: The study introduces Align3, a lightweight method using test-time deliberation to help large language models adapt to dynamic behavioral and safety specifications across various scenarios, and presents SpecBench, a benchmark demonstrating its effectiveness in improving specification alignment with minimal overhead.

Authors:Ahmed Sheta, Mathias Zinnen, Aline Sindel, Andreas Maier, Vincent Christlein
Title: Data Augmentation via Latent Diffusion Models for Detecting Smell-Related Objects in Historical Artworks
Abstract:
Finding smell references in historic artworks is a challenging problem. Beyond artwork-specific challenges such as stylistic variations, their recognition demands exceptionally detailed annotation classes, resulting in annotation sparsity and extreme class imbalance. In this work, we explore the potential of synthetic data generation to alleviate these issues and enable accurate detection of smell-related objects. We evaluate several diffusion-based augmentation strategies and demonstrate that incorporating synthetic data into model training can improve detection performance. Our findings suggest that leveraging the large-scale pretraining of diffusion models offers a promising approach for improving detection accuracy, particularly in niche applications where annotations are scarce and costly to obtain. Furthermore, the proposed approach proves to be effective even with relatively small amounts of data, and scaling it up provides high potential for further enhancements.
中文摘要:本研究证明,利用扩散模型生成合成数据能有效缓解历史艺术品中气味相关物体检测的标注稀疏和类别不平衡问题,显著提升检测性能。
English Summary: This study demonstrates that using diffusion models to generate synthetic data significantly improves the detection of smell-related objects in historical artworks by overcoming annotation scarcity and class imbalance.

Authors:Elena Camuffo, Francesco Barbato, Mete Ozay, Simone Milani, Umberto Michieli
Title: MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment
Abstract:
We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.
中文: MOCHA是一种知识蒸馏方法,将大型教师模型中的物体级多模态知识迁移至轻量级纯视觉学生检测器,无需推理时的文本输入即可实现显著的性能提升。
English: MOCHA is a knowledge distillation method that transfers object-level multimodal knowledge from a large teacher model to a lightweight vision-only student detector, achieving significant performance gains without requiring text input during inference.

Authors:Boyu Zhang, Ping He, Tianyu Du, Xuhong Zhang, Lei Yun, Kingsum Chow, Jianwei Yin
Title: CLMTracing: Black-box User-level Watermarking for Code Language Model Tracing
Abstract:
With the widespread adoption of open-source code language models (code LMs), intellectual property (IP) protection has become an increasingly critical concern. While current watermarking techniques have the potential to identify the code LM to protect its IP, they have limitations when facing the more practical and complex demand, i.e., offering the individual user-level tracing in the black-box setting. This work presents CLMTracing, a black-box code LM watermarking framework employing the rule-based watermarks and utility-preserving injection method for user-level model tracing. CLMTracing further incorporates a parameter selection algorithm sensitive to the robust watermark and adversarial training to enhance the robustness against watermark removal attacks. Comprehensive evaluations demonstrate CLMTracing is effective across multiple state-of-the-art (SOTA) code LMs, showing significant harmless improvements compared to existing SOTA baselines and strong robustness against various removal attacks.
Chinese: 本文提出了CLMTracing,一种采用基于规则水印和保持实用性注入方法的黑盒代码语言模型水印框架,通过参数选择和对抗训练增强抗移除攻击的鲁棒性,实现用户级追踪。
English: This paper introduces CLMTracing, a black-box watermarking framework that enables user-level tracing for code language models by using rule-based watermarks and utility-preserving injection, while enhancing robustness through parameter selection and adversarial training against removal attacks.

Authors:Ayberk Acar, Fangjie Li, Hao Li, Lidia Al-Zogbi, Kanyifeechukwu Jane Oguine, Susheela Sharma Stern, Jesse F. d'Almeida, Robert J. Webster, Ipek Oguz, Jie Ying Wu
Title: Semantic 3D Reconstructions with SLAM for Central Airway Obstruction
Abstract:
Central airway obstruction (CAO) is a life-threatening condition with increasing incidence, caused by tumors in and outside of the airway. Traditional treatment methods such as bronchoscopy and electrocautery can be used to remove the tumor completely; however, these methods carry a high risk of complications. Recent advances allow robotic interventions with lesser risk. The combination of robot interventions with scene understanding and mapping also opens up the possibilities for automation. We present a novel pipeline that enables real-time, semantically informed 3D reconstructions of the central airway using monocular endoscopic video. Our approach combines DROID-SLAM with a segmentation model trained to identify obstructive tissues. The SLAM module reconstructs the 3D geometry of the airway in real time, while the segmentation masks guide the annotation of obstruction regions within the reconstructed point cloud. To validate our pipeline, we evaluate the reconstruction quality using ex vivo models. Qualitative and quantitative results show high similarity between ground truth CT scans and the 3D reconstructions (0.62 mm Chamfer distance). By integrating segmentation directly into the SLAM workflow, our system produces annotated 3D maps that highlight clinically relevant regions in real time. High-speed capabilities of the pipeline allows quicker reconstructions compared to previous work, reflecting the surgical scene more accurately. To the best of our knowledge, this is the first work to integrate semantic segmentation with real-time monocular SLAM for endoscopic CAO scenarios. Our framework is modular and can generalize to other anatomies or procedures with minimal changes, offering a promising step toward autonomous robotic interventions.
中央气道阻塞是一种严重的疾病,越来越多地采用机器人方法治疗,本研究提出了一种实时3D重建系统,通过集成语义分割在内窥镜手术中精确定位阻塞区域,并经CT扫描验证具有高精度。
Central airway obstruction is a serious condition increasingly treated with robotic methods, and this study introduces a real-time 3D reconstruction system that integrates semantic segmentation to precisely identify obstructions during endoscopic procedures, validated by high accuracy compared to CT scans.

Authors:Hao Li, Hicham Masri, Filipe R. Cogo, Abdul Ali Bangash, Bram Adams, Ahmed E. Hassan
Title: Understanding Prompt Management in GitHub Repositories: A Call for Best Practices
Abstract:
The rapid adoption of foundation models (e.g., large language models) has given rise to promptware, i.e., software built using natural language prompts. Effective management of prompts, such as organization and quality assurance, is essential yet challenging. In this study, we perform an empirical analysis of 24,800 open-source prompts from 92 GitHub repositories to investigate prompt management practices and quality attributes. Our findings reveal critical challenges such as considerable inconsistencies in prompt formatting, substantial internal and external prompt duplication, and frequent readability and spelling issues. Based on these findings, we provide actionable recommendations for developers to enhance the usability and maintainability of open-source prompts within the rapidly evolving promptware ecosystem.
中文: 本研究通过对24,800个开源提示符的实证分析,揭示了格式不一致、重复及可读性差等关键管理挑战,并为提升提示符可用性和可维护性提供了实用建议。
English: This empirical study of 24,800 open-source prompts reveals significant management challenges including formatting inconsistencies, duplication, and readability issues, offering actionable recommendations to improve prompt usability and maintainability.

Authors:Jianping Li, Kaisong Zhu, Zhongyuan Liu, Rui Jin, Xinhang Xu, Pengfei Wan, Lihua Xie
Title: Adaptive Motorized LiDAR Scanning Control for Robust Localization with OpenStreetMap
Abstract:
LiDAR-to-OpenStreetMap (OSM) localization has gained increasing attention, as OSM provides lightweight global priors such as building footprints. These priors enhance global consistency for robot navigation, but OSM is often incomplete or outdated, limiting its reliability in real-world deployment. Meanwhile, LiDAR itself suffers from a limited field of view (FoV), where motorized rotation is commonly used to achieve panoramic coverage. Existing motorized LiDAR systems, however, typically employ constant-speed scanning that disregards both scene structure and map priors, leading to wasted effort in feature-sparse regions and degraded localization accuracy. To address these challenges, we propose Adaptive LiDAR Scanning with OSM guidance, a framework that integrates global priors with local observability prediction to improve localization robustness. Specifically, we augment uncertainty-aware model predictive control with an OSM-aware term that adaptively allocates scanning effort according to both scene-dependent observability and the spatial distribution of OSM features. The method is implemented in ROS with a motorized LiDAR odometry backend and evaluated in both simulation and real-world experiments. Results on campus roads, indoor corridors, and urban environments demonstrate significant reductions in trajectory error compared to constant-speed baselines, while maintaining scan completeness. These findings highlight the potential of coupling open-source maps with adaptive LiDAR scanning to achieve robust and efficient localization in complex environments.
中文: 本研究提出了一种基于OpenStreetMap引导的自适应激光雷达扫描框架,通过结合场景可观测性与地图特征动态分配扫描资源,在多种环境中显著降低了轨迹误差。
English: This study introduces an adaptive LiDAR scanning framework guided by OpenStreetMap to enhance robot localization by dynamically allocating scanning effort based on scene observability and map features, significantly reducing trajectory errors in diverse environments.

Authors:Pat Pataranutaporn, Sheer Karny, Chayapatr Archiwaranguprok, Constanze Albrecht, Auren R. Liu, Pattie Maes
Title: "My Boyfriend is AI": A Computational Analysis of Human-AI Companionship in Reddit's AI Community
Abstract:
The emergence of AI companion applications has created novel forms of intimate human-AI relationships, yet empirical research on these communities remains limited. We present the first large-scale computational analysis of r/MyBoyfriendIsAI, Reddit's primary AI companion community (27,000+ members). Using exploratory qualitative analysis and quantitative analysis employing classifiers, we identify six primary conversation themes, with visual sharing of couple pictures and ChatGPT-specific discussions dominating the discourse of the most viewed posts. Through analyzing the top posts in the community, our findings reveal how community members' AI companionship emerges unintentionally through functional use rather than deliberate seeking, with users reporting therapeutic benefits led by reduced loneliness, always-available support, and mental health improvements. Our work covers primary concerns about human intimacy with AIs such as emotional dependency, reality dissociation, and grief from model updates. We observe users materializing relationships following traditional human-human relationship customs, such as wedding rings. Community dynamics indicate active resistance to stigmatization through advocacy and mutual validation. This work contributes an empirical understanding of AI companionship as an emerging sociotechnical phenomenon.
中文摘要:本研究首次对人工智能伴侣社区进行大规模分析,揭示了用户如何通过功能性互动无意间形成具有治疗效果的陪伴关系,同时应对情感依赖和现实脱节等主要关切。
English Summary: This study provides the first large-scale analysis of an AI companion community, revealing how users unintentionally form therapeutic relationships through functional interactions while navigating concerns like emotional dependency and reality dissociation.

Authors:Wan Xu, Feng Zhu, Yihan Zeng, Yuanfan Guo, Ming Liu, Hang Xu, Wangmeng Zuo
Title: GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration
Abstract:
Video detailed captioning aims to generate comprehensive video descriptions to facilitate video understanding. Recently, most efforts in the video detailed captioning community have been made towards a local-to-global paradigm, which first generates local captions from video clips and then summarizes them into a global caption. However, we find this paradigm leads to less detailed and contextual-inconsistent captions, which can be attributed to (1) no mechanism to ensure fine-grained captions, and (2) weak interaction between local and global captions. To remedy the above two issues, we propose GLaVE-Cap, a Global-Local aligned framework with Vision Expert integration for Captioning, which consists of two core modules: TrackFusion enables comprehensive local caption generation, by leveraging vision experts to acquire cross-frame visual prompts, coupled with a dual-stream structure; while CaptionBridge establishes a local-global interaction, by using global context to guide local captioning, and adaptively summarizing local captions into a coherent global caption. Besides, we construct GLaVE-Bench, a comprehensive video captioning benchmark featuring 5X more queries per video than existing benchmarks, covering diverse visual dimensions to facilitate reliable evaluation. We further provide a training dataset GLaVE-1.2M containing 16K high-quality fine-grained video captions and 1.2M related question-answer pairs. Extensive experiments on four benchmarks show that our GLaVE-Cap achieves state-of-the-art performance. Besides, the ablation studies and student model analyses further validate the effectiveness of the proposed modules and the contribution of GLaVE-1.2M to the video understanding community. The source code, model weights, benchmark, and dataset will be open-sourced.
中文:提出的GLaVE-Cap框架通过整合跨帧视觉提示生成细粒度局部描述,并建立局部与全局的双向交互机制,有效解决了现有视频详细描述方法存在的细节缺失和上下文不一致问题,在多个基准测试中达到最优性能。
English: The proposed GLaVE-Cap framework addresses limitations in video detailed captioning by integrating cross-frame visual prompts for fine-grained local descriptions and establishing bidirectional local-global interactions, achieving state-of-the-art performance across multiple benchmarks.

Authors:Matthias Wüest, Francis Engelmann, Ondrej Miksik, Marc Pollefeys, Daniel Barath
Title: UnLoc: Leveraging Depth Uncertainties for Floorplan Localization
Abstract:
We propose UnLoc, an efficient data-driven solution for sequential camera localization within floorplans. Floorplan data is readily available, long-term persistent, and robust to changes in visual appearance. We address key limitations of recent methods, such as the lack of uncertainty modeling in depth predictions and the necessity for custom depth networks trained for each environment. We introduce a novel probabilistic model that incorporates uncertainty estimation, modeling depth predictions as explicit probability distributions. By leveraging off-the-shelf pre-trained monocular depth models, we eliminate the need to rely on per-environment-trained depth networks, enhancing generalization to unseen spaces. We evaluate UnLoc on large-scale synthetic and real-world datasets, demonstrating significant improvements over existing methods in terms of accuracy and robustness. Notably, we achieve $2.7$ times higher localization recall on long sequences (100 frames) and $16.7$ times higher on short ones (15 frames) than the state of the art on the challenging LaMAR HGE dataset.
Chinese: UnLoc是一种利用平面图进行序列相机定位的高效数据驱动方法,它通过引入带有不确定性估计的概率模型并利用预训练深度模型,显著提升了泛化能力和定位精度,优于现有技术。
English: UnLoc is an efficient data-driven method for sequential camera localization using floorplans, introducing a probabilistic model with uncertainty estimation and leveraging pre-trained depth models to improve generalization and achieve significant accuracy gains over existing techniques.

Authors:Andrea Tonini, Tan Bui-Thanh, Francesco Regazzoni, Luca Dede', Alfio Quarteroni
Title: Improvements on uncertainty quantification with variational autoencoders
Abstract:
Inverse problems aim to determine model parameters of a mathematical problem from given observational data. Neural networks can provide an efficient tool to solve these problems. In the context of Bayesian inverse problems, Uncertainty Quantification Variational AutoEncoders (UQ-VAE), a class of neural networks, approximate the posterior distribution mean and covariance of model parameters. This allows for both the estimation of the parameters and their uncertainty in relation to the observational data. In this work, we propose a novel loss function for training UQ-VAEs, which includes, among other modifications, the removal of a sample mean term from an already existing one. This modification improves the accuracy of UQ-VAEs, as the original theoretical result relies on the convergence of the sample mean to the expected value (a condition that, in high dimensional parameter spaces, requires a prohibitively large number of samples due to the curse of dimensionality). Avoiding the computation of the sample mean significantly reduces the training time in high dimensional parameter spaces compared to previous literature results. Under this new formulation, we establish a new theoretical result for the approximation of the posterior mean and covariance for general mathematical problems. We validate the effectiveness of UQ-VAEs through three benchmark numerical tests: a Poisson inverse problem, a non affine inverse problem and a 0D cardiocirculatory model, under the two clinical scenarios of systemic hypertension and ventricular septal defect. For the latter case, we perform forward uncertainty quantification.
中文摘要:提出了一种新的UQ-VAE训练损失函数,通过移除样本均值项,在提高高维贝叶斯反问题计算精度的同时显著减少训练时间,并为后验分布估计建立了新的理论框架。
English Summary: A new loss function for training UQ-VAEs is proposed that eliminates the sample mean term, enhancing both accuracy and computational efficiency in high-dimensional Bayesian inverse problems while establishing new theoretical guarantees for posterior estimation.

Authors:Tao Han, Wanghan Xu, Junchao Gong, Xiaoyu Yue, Song Guo, Luping Zhou, Lei Bai
Title: InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis
Abstract:
Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.
Chinese Summary: InfGen提出了一种新型生成器,替代潜在扩散模型中的VAE解码器,能够从固定大小的潜在表示生成任意分辨率的图像而无需重新训练,从而降低计算复杂度并将4K图像生成时间缩短至10秒以内。
English Summary: InfGen introduces a novel generator that replaces the VAE decoder in latent diffusion models, enabling arbitrary-resolution image generation from a fixed-size latent without retraining, which reduces computational complexity and cuts 4K image generation time to under 10 seconds.

Authors:James Jewitt, Hao Li, Bram Adams, Gopi Krishnan Rajbahadur, Ahmed E. Hassan
Title: From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem
Abstract:
Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.
中文摘要:开源AI生态中的隐性许可冲突带来严重法律风险,审计发现35.5%的模型集成存在违规重新授权行为,同时原型工具可解决86.4%的许可冲突问题。
English Summary: Hidden license conflicts in open-source AI systems create significant legal risks, with our audit revealing that 35.5% of model integrations improperly relicense restrictive terms and demonstrating a prototype tool that resolves 86.4% of these conflicts.

Authors:Sirui Xu, Yu-Wei Chao, Liuyu Bian, Arsalan Mousavian, Yu-Xiong Wang, Liang-Yan Gui, Wei Yang
Title: Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration
Abstract:
Hand-object motion-capture (MoCap) repositories offer large-scale, contact-rich demonstrations and hold promise for scaling dexterous robotic manipulation. Yet demonstration inaccuracies and embodiment gaps between human and robot hands limit the straightforward use of these data. Existing methods adopt a three-stage workflow, including retargeting, tracking, and residual correction, which often leaves demonstrations underused and compound errors across stages. We introduce Dexplore, a unified single-loop optimization that jointly performs retargeting and tracking to learn robot control policies directly from MoCap at scale. Rather than treating demonstrations as ground truth, we use them as soft guidance. From raw trajectories, we derive adaptive spatial scopes, and train with reinforcement learning to keep the policy in-scope while minimizing control effort and accomplishing the task. This unified formulation preserves demonstration intent, enables robot-specific strategies to emerge, improves robustness to noise, and scales to large demonstration corpora. We distill the scaled tracking policy into a vision-based, skill-conditioned generative controller that encodes diverse manipulation skills in a rich latent representation, supporting generalization across objects and real-world deployment. Taken together, these contributions position Dexplore as a principled bridge that transforms imperfect demonstrations into effective training signals for dexterous manipulation.
中文: Dexplore提出了一种统一的优化框架,通过将动作捕捉数据视为柔性指导而非绝对标准,结合强化学习直接学习机器人控制策略,既保留演示意图又催生机器人专属操作方案。
English: Dexplore introduces a unified optimization framework that directly learns robot control policies from motion-capture data by treating demonstrations as soft guidance and using reinforcement learning to preserve intent while enabling robot-specific strategies.

Authors:Qinnan Hu, Yuntao Wang, Yuan Gao, Zhou Su, Linkang Du
Title: Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and Solutions
Abstract:
Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.
中文: 本文提出一种基于区块链的分层监管架构,通过行为追踪、信誉评估和恶意行为预测模块,为多智能体系统建立可信、可扩展的监管机制。
English: This paper introduces a blockchain-based layered framework to govern autonomous agents by enabling behavior tracking, reputation evaluation, and malicious activity prediction for trustworthy multi-agent collaboration.

Authors:Wei Guo, Maura Pintor, Ambra Demontis, Battista Biggio
Title: Prototype-Guided Robust Learning against Backdoor Attacks
Abstract:
Backdoor attacks poison the training data to embed a backdoor in the model, causing it to behave normally on legitimate inputs but maliciously when specific trigger signals appear. Training a benign model from a dataset poisoned by backdoor attacks is challenging. Existing works rely on various assumptions and can only defend against backdoor attacks with specific trigger signals, high poisoning ratios, or when the defender possesses a large, untainted, validation dataset. In this paper, we propose a defense called Prototype-Guided Robust Learning (PGRL), which overcomes all the aforementioned limitations, being robust against diverse backdoor attacks. Leveraging a tiny set of benign samples, PGRL generates prototype vectors to guide the training process. We compare our PGRL with 8 existing defenses, showing that it achieves superior robustness. We also demonstrate that PGRL generalizes well across various architectures, datasets, and advanced attacks. Finally, to evaluate our PGRL in the worst-case scenario, we perform an adaptive attack, where the attackers fully know the details of the defense.
Chinese: 本文提出的原型引导鲁棒学习(PGRL)方法通过少量良性样本生成原型向量,有效抵御多种后门攻击,克服了现有防御手段的局限,并在不同架构、数据集和攻击场景下展现出卓越的鲁棒性。
English: This paper introduces Prototype-Guided Robust Learning (PGRL), a defense method that effectively counters diverse backdoor attacks by using a small set of benign samples to generate prototype vectors, overcoming the limitations of existing approaches and demonstrating superior robustness across various architectures, datasets, and attacks.

Authors:Wei Guo, Maura Pintor, Ambra Demontis, Battista Biggio
Title: Silent Until Sparse: Backdoor Attacks on Semi-Structured Sparsity
Abstract:
In the deployment phase, semi-structured sparsity accelerates the execution of deep neural networks on modern GPUs via sparse matrix multiplication. In this paper, targeting the semi-structured sparsity, we introduce a Silent Until Sparse (SUS) backdoor attack, where the released full model remains silent (benign), but becomes a backdoored model after sparsification. The attack operates in two phases: (i) in the backdoor training phase, the backdoor functionality is injected into specific weights that will be retained during the pruning process; (ii) in the backdoor hiding phase, the malicious behavior is concealed by fine-tuning elements that will be pruned away. This dual-phase approach ensures that the attack remains undetectable in the released model, but activates properly once the model is pruned with the semi-structured sparsity. Through extensive experiments, we show that our attack successfully threatens the semi-structured sparsity algorithms from both NVIDIA and PyTorch. Our empirical results show that, regardless of model architecture, the attack success rate of the released model remains below 10% prior to sparsification but exceeds 99% afterward. Moreover, we demonstrate that SUS attack is robust against state-of-the-art backdoor defenses and finetuning, highlighting a critical vulnerability in current model compression and deployment pipelines.
中文摘要:SUS后门攻击针对半结构化稀疏性,使模型在剪枝前保持良性,剪枝后激活恶意功能,攻击成功率从不足10%跃升至超过99%,且能规避现有防御措施。
English Summary: The SUS backdoor attack targets semi-structured sparsity by creating models that remain benign until pruned, achieving over 99% attack success post-sparsification while evading detection in the full model.

Authors:Mikel Robredo, Matteo Esposito, Fabio Palomba, Rafael Peñaloza, Valentina Lenarduzzi
Title: What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects
Abstract:
Context. Code refactoring improves software quality without changing external behavior. Despite its advantages, its benefits are hindered by the considerable cost of time, resources, and continuous effort it demands. Aim. Understanding why developers refactor, and which metrics capture these motivations, may support wider and more effective use of refactoring in practice. Method. We performed a large-scale empirical study to analyze developers refactoring activity, leveraging Large Language Models (LLMs) to identify underlying motivations from version control data, comparing our findings with previous motivations reported in the literature. Results. LLMs matched human judgment in 80% of cases, but aligned with literature-based motivations in only 47%. They enriched 22% of motivations with more detailed rationale, often highlighting readability, clarity, and structural improvements. Most motivations were pragmatic, focused on simplification and maintainability. While metrics related to developer experience and code readability ranked highest, their correlation with motivation categories was weak. Conclusions. We conclude that LLMs effectively capture surface-level motivations but struggle with architectural reasoning. Their value lies in providing localized explanations, which, when combined with software metrics, can form hybrid approaches. Such integration offers a promising path toward prioritizing refactoring more systematically and balancing short-term improvements with long-term architectural goals.
中文摘要:研究表明,大型语言模型能有效识别开发者重构代码的表面动机(如可读性),但在架构推理方面存在不足,结合软件指标的混合方法有望更系统地平衡短期改进与长期架构目标。
English Summary: This study demonstrates that Large Language Models (LLMs) can effectively identify developers' surface-level refactoring motivations like code readability, though they struggle with architectural reasoning, suggesting hybrid approaches combining LLMs with software metrics could better prioritize refactoring efforts.

Authors:Xiaomeng Hu, Fei Huang, Chenhan Yuan, Junyang Lin, Tsung-Yi Ho
Title: CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention
Abstract:
As large language models (LLMs) are increasingly deployed in real-world applications, ensuring the safety of their outputs during decoding has become a critical challenge. However, existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality. In this work, we propose CARE, a novel framework for decoding-time safety alignment that integrates three key components: (1) a guard model for real-time safety monitoring, enabling detection of potentially unsafe content; (2) a rollback mechanism with a token buffer to correct unsafe outputs efficiently at an earlier stage without disrupting the user experience; and (3) a novel introspection-based intervention strategy, where the model generates self-reflective critiques of its previous outputs and incorporates these reflections into the context to guide subsequent decoding steps. The framework achieves a superior safety-quality trade-off by using its guard model for precise interventions, its rollback mechanism for timely corrections, and our novel introspection method for effective self-correction. Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience while maintaining high response quality.
Chinese: CARE框架通过整合实时监控、回滚机制和基于自省的自我修正,在解码过程中提升大语言模型的安全性,实现了安全性和响应质量的最佳平衡。
English: The CARE framework enhances LLM safety during decoding by integrating real-time monitoring, a rollback mechanism, and introspection-based self-correction to achieve an optimal balance between safety and response quality.

Authors:Jian Wu, Hang Yu, Bingchang Liu, Wenjie Yang, Peng Di, Jianguo Li, Yue Zhang
Title: LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection
Abstract:
Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.
中文: LAMDAS提出了一种创新方法,利用大语言模型自身作为隐式分类器进行高效的领域特定数据选择,在性能和计算效率上均优于全数据训练及现有先进基线。
English: LAMDAS introduces a novel method using the LLM itself as an implicit classifier for efficient domain-specific data selection, outperforming full-data training and state-of-the-art baselines with superior computational efficiency.

Authors:Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, Dong Qiu
Title: Agentic Software Engineering: Foundational Pillars and a Research Roadmap
Abstract:
Agentic Software Engineering (SE 3.0) represents a new era where intelligent agents are tasked not with simple code generation, but with achieving complex, goal-oriented SE objectives. To harness these new capabilities while ensuring trustworthiness, we must recognize a fundamental duality within the SE field in the Agentic SE era, comprising two symbiotic modalities: SE for Humans and SE for Agents. This duality demands a radical reimagining of the foundational pillars of SE (actors, processes, tools, and artifacts) which manifest differently across each modality. We propose two purpose-built workbenches to support this vision. The Agent Command Environment (ACE) serves as a command center where humans orchestrate and mentor agent teams, handling outputs such as Merge-Readiness Packs (MRPs) and Consultation Request Packs (CRPs). The Agent Execution Environment (AEE) is a digital workspace where agents perform tasks while invoking human expertise when facing ambiguity or complex trade-offs. This bi-directional partnership, which supports agent-initiated human callbacks and handovers, gives rise to new, structured engineering activities (i.e., processes) that redefine human-AI collaboration, elevating the practice from agentic coding to true agentic software engineering. This paper presents the Structured Agentic Software Engineering (SASE) vision, outlining several of the foundational pillars for the future of SE. The paper culminates in a research roadmap that identifies a few key challenges and opportunities while briefly discussing the resulting impact of this future on SE education. Our goal is not to offer a definitive solution, but to provide a conceptual scaffold with structured vocabulary to catalyze a community-wide dialogue, pushing the SE community to think beyond its classic, human-centric tenets toward a disciplined, scalable, and trustworthy agentic future.
中文摘要:代理软件工程(SE 3.0)通过人类主导的代理指挥环境(ACE)与代理执行环境(AEE)的双工作台架构,构建了结构化人机协作新模式,将软件工程从基础编码提升至面向复杂目标的系统化工程实践。
English Summary: Agentic Software Engineering (SE 3.0) introduces a dual-modality framework where humans orchestrate agents through specialized workbenches (ACE and AEE), transforming software engineering into structured human-AI collaboration that moves beyond simple coding to achieve complex objectives.

Authors:Zhilin Wang, Zhe Yang, Yun Luo, Yafu Li, Haoran Zhang, Runzhe Zhan, Derek F. Wong, Jizhe Zhou, Yu Cheng
Title: Synthesizing Sheet Music Problems for Evaluation and Reinforcement Learning
Abstract:
Enhancing the ability of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) to interpret sheet music is a crucial step toward building AI musicians. However, current research lacks both evaluation benchmarks and training data for sheet music reasoning. To address this, we propose the idea of synthesizing sheet music problems grounded in music theory, which can serve both as evaluation benchmarks and as training data for reinforcement learning with verifiable rewards (RLVR). We introduce a data synthesis framework that generates verifiable sheet music questions in both textual and visual modalities, leading to the Synthetic Sheet Music Reasoning Benchmark (SSMR-Bench) and a complementary training set. Evaluation results on SSMR-Bench show the importance of models' reasoning abilities in interpreting sheet music. At the same time, the poor performance of Gemini 2.5-Pro highlights the challenges that MLLMs still face in interpreting sheet music in a visual format. By leveraging synthetic data for RLVR, Qwen3-8B-Base and Qwen2.5-VL-Instruct achieve improvements on the SSMR-Bench. Besides, the trained Qwen3-8B-Base surpasses GPT-4 in overall performance on MusicTheoryBench and achieves reasoning performance comparable to GPT-4 with the strategies of Role play and Chain-of-Thought. Notably, its performance on math problems also improves relative to the original Qwen3-8B-Base. Furthermore, our results show that the enhanced reasoning ability can also facilitate music composition. In conclusion, we are the first to propose the idea of synthesizing sheet music problems based on music theory rules, and demonstrate its effectiveness not only in advancing model reasoning for sheet music understanding but also in unlocking new possibilities for AI-assisted music creation.
中文: 本研究提出一种基于音乐理论合成可验证乐谱问题的框架,有效提升了AI模型在乐谱理解和音乐创作方面的推理能力。
English: This study introduces a synthetic data framework for generating verifiable sheet music problems to enhance AI models' reasoning in music interpretation, demonstrating improved performance in benchmarks and music composition.

Authors:Kaili sun, Xingyu Miao, Bing Zhai, Haoran Duan, Yang Long
Title: Decoding Visual Neural Representations by Multimodal with Dynamic Balancing
Abstract:
In this work, we propose an innovative framework that integrates EEG, image, and text data, aiming to decode visual neural representations from low signal-to-noise ratio EEG signals. Specifically, we introduce text modality to enhance the semantic correspondence between EEG signals and visual content. With the explicit semantic labels provided by text, image and EEG features of the same category can be more closely aligned with the corresponding text representations in a shared multimodal space. To fully utilize pre-trained visual and textual representations, we propose an adapter module that alleviates the instability of high-dimensional representation while facilitating the alignment and fusion of cross-modal features. Additionally, to alleviate the imbalance in multimodal feature contributions introduced by the textual representations, we propose a Modal Consistency Dynamic Balance (MCDB) strategy that dynamically adjusts the contribution weights of each modality. We further propose a stochastic perturbation regularization (SPR) term to enhance the generalization ability of semantic perturbation-based models by introducing dynamic Gaussian noise in the modality optimization process. The evaluation results on the ThingsEEG dataset show that our method surpasses previous state-of-the-art methods in both Top-1 and Top-5 accuracy metrics, improving by 2.0\% and 4.7\% respectively.
Chinese: 本研究提出了一种创新的多模态框架,通过整合脑电图、图像和文本数据,从低信噪比的脑电信号中解码视觉神经表征,并在ThingsEEG数据集上以显著提升的准确率超越了现有最优方法。
English: This study introduces a novel multimodal framework that integrates EEG, image, and text data to decode visual neural representations from noisy EEG signals, achieving state-of-the-art performance with significant accuracy improvements on the ThingsEEG dataset.

Authors:Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Michael Littman, Jun Wang, Shuicheng Yan, Philip Torr, Lei Bai
Title: The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Abstract:
The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.
中文: 智能体强化学习将大语言模型从被动序列生成器转变为复杂环境中的自主决策体,通过强化学习将规划、工具使用等核心能力转化为自适应行为,并构建了分类体系与研究资源以推动领域发展。
English: Agentic reinforcement learning transforms large language models from passive generators into autonomous agents operating in complex environments, using reinforcement learning to develop adaptive capabilities like planning and tool use while establishing a taxonomy and resources for future research.

Authors:Enzhi Zhou, Yue Xiao, Ziyue Liu, Sotiris A. Tegos, Panagiotis D. Diamantoulakis, George K. Karagiannidis
Title: Beamforming Design for Pinching Antenna Systems with Multiple Receive Antennas
Abstract:
Next-generation networks require intelligent and robust channel conditions to support ultra-high data rates, seamless connectivity, and large-scale device deployments in dynamic environments. While flexible antenna technologies such as fluid and movable antennas offer some degree of adaptability, their limited reconfiguration range and structural rigidity reduce their effectiveness in restoring line-of-sight (LoS) links. As a complementary solution, pinching antenna systems (PASs) enable fine-grained, hardware-free control of radiation locations along a waveguide, offering enhanced flexibility in challenging propagation environments, especially under non-LoS (NLoS) conditions. This paper introduces a general and novel modeling framework for downlink PASs targeting users equipped with multiple receive antennas, addressing a practical yet underexplored scenario in the existing literature. Specifically, we first derive an analytical relationship between the received signal-to-noise ratio and the pinching antenna (PA) positions, and based on this, we propose a two-layer placement strategy. First, we optimize the central radiation point using large-scale channel characteristics, and then we use a heuristic compressed placement algorithm to approximate phase alignment across multiple receive antennas and select a spatially compact set of active elements. Simulation results demonstrate notable performance gains over conventional single-antenna schemes, particularly in short-range scenarios with dense PAs and widely spaced user antennas.
下一代网络需要智能信道管理以支持高速率和连接性,而捏合天线系统通过无硬件的灵活控制提升了非视距环境下的性能,本文提出的新型建模框架和布局策略在仿真中显著优于传统单天线方案。
Next-generation networks need intelligent channel management for high data rates and connectivity, with pinching antenna systems offering flexible, hardware-free control to enhance performance in non-line-of-sight conditions, as demonstrated by a novel modeling framework and placement strategy that significantly outperforms traditional single-antenna approaches.

Authors:Yongqi Jin, Jun-Jie Wang, Fanjie Xu, Xiaohong Ji, Zhifeng Gao, Linfeng Zhang, Guolin Ke, Rong Zhu, Weinan E
Title: NMR-Solver: Automated Structure Elucidation via Large-Scale Spectral Matching and Physics-Guided Fragment Optimization
Abstract:
Nuclear Magnetic Resonance (NMR) spectroscopy is one of the most powerful and widely used tools for molecular structure elucidation in organic chemistry. However, the interpretation of NMR spectra to determine unknown molecular structures remains a labor-intensive and expertise-dependent process, particularly for complex or novel compounds. Although recent methods have been proposed for molecular structure elucidation, they often underperform in real-world applications due to inherent algorithmic limitations and limited high-quality data. Here, we present NMR-Solver, a practical and interpretable framework for the automated determination of small organic molecule structures from $^1$H and $^{13}$C NMR spectra. Our method introduces an automated framework for molecular structure elucidation, integrating large-scale spectral matching with physics-guided fragment-based optimization that exploits atomic-level structure-spectrum relationships in NMR. We evaluate NMR-Solver on simulated benchmarks, curated experimental data from the literature, and real-world experiments, demonstrating its strong generalization, robustness, and practical utility in challenging, real-life scenarios. NMR-Solver unifies computational NMR analysis, deep learning, and interpretable chemical reasoning into a coherent system. By incorporating the physical principles of NMR into molecular optimization, it enables scalable, automated, and chemically meaningful molecular identification, establishing a generalizable paradigm for solving inverse problems in molecular science.
中文: NMR-Solver是一种自动化框架,通过结合谱图匹配和物理引导的优化方法,能根据核磁共振谱精确解析小分子有机化合物结构,在实际应用中表现出卓越性能。
English: NMR-Solver is an automated framework that integrates spectral matching and physics-guided optimization to accurately determine small organic molecule structures from NMR spectra, demonstrating strong performance in real-world applications.

Authors:Hao Zhou, Yiyan Ma, Dan Fei, Weirong Liu, Zhengyu Zhang, Mi Yang, Guoyu Ma, Yunlong Lu, Ruisi He, Guoyu Wang, Cheng Li, Zhaohui Song, Bo Ai
Title: Delay-Doppler Domain Channel Measurements and Modeling in High-Speed Railways
Abstract:
As next-generation wireless communication systems need to be able to operate in high-frequency bands and high-mobility scenarios, delay-Doppler (DD) domain multicarrier (DDMC) modulation schemes, such as orthogonal time frequency space (OTFS), demonstrate superior reliability over orthogonal frequency division multiplexing (OFDM). Accurate DD domain channel modeling is essential for DDMC system design. However, since traditional channel modeling approaches are mainly confined to time, frequency, and space domains, the principles of DD domain channel modeling remain poorly studied. To address this issue, we propose a systematic DD domain channel measurement and modeling methodology in high-speed railway (HSR) scenarios. First, we design a DD domain channel measurement method based on the long-term evolution for railway (LTE-R) system. Second, for DD domain channel modeling, we investigate quasi-stationary interval, statistical power modeling of multipath components, and particularly, the quasi-invariant intervals of DD domain channel fading coefficients. Third, via LTE-R measurements at 371 km/h, taking the quasi-stationary interval as the decision criterion, we establish DD domain channel models under different channel time-varying conditions in HSR scenarios. Fourth, the accuracy of proposed DD domain channel models is validated via bit error rate comparison of OTFS transmission. In addition, simulation verifies that in HSR scenario, the quasi-invariant interval of DD domain channel fading coefficient is on millisecond (ms) order of magnitude, which is much smaller than the quasi-stationary interval length on $100$ ms order of magnitude. This study could provide theoretical guidance for DD domain modeling in high-mobility environments, supporting future DDMC and integrated sensing and communication designs for 6G and beyond.
中文: 本研究针对高速铁路场景提出了一种系统的时延-多普勒域信道建模方法,通过LTE-R测量验证了模型准确性,并为未来6G移动通信系统的设计提供了理论支撑。
English: The study introduces a systematic delay-Doppler domain channel modeling approach for high-speed railway scenarios, demonstrating its effectiveness through LTE-R measurements and validating the model's accuracy via OTFS transmission performance, offering guidance for future 6G communication systems.

Authors:Zhenghao Zhang, Ziying Zhang, Junchao Liao, Xiangyu Meng, Qiang Hu, Siyu Zhu, Xiaoyun Zhang, Long Qin, Weizhi Wang
Title: LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face Editing
Abstract:
Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapping positional encoding that integrates facial and image tokens for unified processing, enabling flexible yet decoupled geometry-appearance interactions with high efficiency and strong identity preservation; and (3) a landmark predictor that leverages vision-language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency. Code and dataset will be made publicly available upon acceptance.
中文: LaTo提出了一种基于地标标记化的扩散变换器,通过将地标量化为离散标记并与图像标记融合,实现了精细化的属性控制和身份保持,在身份保持和语义一致性方面显著优于现有方法。
English: LaTo introduces a landmark-tokenized diffusion transformer that enhances precise attribute control and identity preservation in face editing by quantizing landmarks into discrete tokens and integrating them with image tokens, outperforming existing methods in both identity preservation and semantic consistency.

Authors:Takuya Fujimura, Kota Dohi, Natsuo Yamashita, Yohei Kawaguchi
Title: Can VLM Pseudo-Labels Train a Time-Series QA Model That Outperforms the VLM?
Abstract:
Time-series question answering (TSQA) tasks face significant challenges due to the lack of labeled data. Alternatively, with recent advancements in large-scale models, vision-language models (VLMs) have demonstrated the potential to analyze time-series signals in a zero-shot manner. In this paper, we propose a training approach that uses pseudo labels generated by a VLM. Although VLMs can produce incorrect labels, TSQA models can still be effectively trained based on the property that deep neural networks are inherently robust to such noisy labels. Our experimental results demonstrate that TSQA models are not only successfully trained with pseudo labels, but also surpass the performance of the VLM itself by leveraging a large amount of unlabeled data.
中文: 本文提出了一种利用视觉语言模型生成伪标签来训练时序问答模型的方法,通过深度神经网络对噪声标签的鲁棒性,成功利用大量未标注数据使模型性能超越原视觉语言模型。
English: This paper introduces a training method for time-series question answering models using pseudo labels from vision-language models, effectively leveraging their robustness to noisy labels and outperforming the original models with abundant unlabeled data.

Authors:Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh
Title: Polychromic Objectives for Reinforcement Learning
Abstract:
Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@$k$ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.
Reinforcement learning fine-tuning often causes policy collapse, so we introduce a polychromic objective to maintain diversity and adapt PPO with vine sampling and modified advantages, improving success rates and generalization across tasks.
English Summary:

Authors:Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang
Title: VGGT-X: When VGGT Meets Dense Novel View Synthesis
Abstract:
We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/
Chinese: 本研究通过引入VGGT-X系统,解决了三维基础模型在密集新视角合成中的内存限制和输出缺陷问题,实现了不依赖传统COLMAP流程的最先进性能。
English: This research addresses the challenges of scaling 3D Foundation Models for dense Novel View Synthesis by introducing VGGT-X, which overcomes memory limitations and output imperfections to achieve state-of-the-art results without relying on traditional COLMAP pipelines.

Authors:Chuanyang Jin, Jing Xu, Bo Liu, Leitian Tao, Olga Golovneva, Tianmin Shu, Wenting Zhao, Xian Li, Jason Weston
Title: The Era of Real-World Human Interaction: RL from User Conversations
Abstract:
We posit that to achieve continual model improvement and multifaceted alignment, future models must learn from natural human interaction. Current conversational models are aligned using pre-annotated, expert-generated human feedback. In this work, we introduce Reinforcement Learning from Human Interaction (RLHI), a paradigm that learns directly from in-the-wild user conversations. We develop two complementary methods: (1) RLHI with User-Guided Rewrites, which revises unsatisfactory model outputs based on users' natural-language follow-up responses, (2) RLHI with User-Based Rewards, which learns via a reward model conditioned on knowledge of the user's long-term interaction history (termed persona). Together, these methods link long-term user personas to turn-level preferences via persona-conditioned preference optimization. Trained on conversations derived from WildChat, both RLHI variants outperform strong baselines in personalization and instruction-following, and similar feedback enhances performance on reasoning benchmarks. These results suggest organic human interaction offers scalable, effective supervision for personalized alignment.
中文摘要:未来的模型应通过自然的人类互动持续改进,本文提出的基于真实对话的人类交互强化学习(RLHI)方法在个性化和指令遵循方面优于现有基线。
English Summary: Future models should learn from natural human interactions to improve continuously, and the introduced Reinforcement Learning from Human Interaction (RLHI) methods, trained on real conversations, outperform baselines in personalization and instruction-following.

Authors:Sogol Masoumzadeh, Keheliya Gallaba, Dayi Lin, Ahmed E. Hassan
Title: Towards Reliable Generation of Executable Workflows by Foundation Models
Abstract:
Recent advancements in Foundation Models (FMs) have demonstrated significant progress in comprehending complex natural language to perform intricate tasks. Successfully executing these tasks often requires orchestrating calls to FMs alongside other software components. However, manually decomposing a task into a coherent sequence of smaller, logically aggregated steps, commonly referred to as workflows, demands considerable effort and specialized domain knowledge. While FMs can assist in generating such workflows specified in domain-specific languages (DSLs), achieving accuracy and reliability in this process remains a challenge. This work introduces a framework that leverages static analysis feedback to enable FMs to detect and repair defects in the DSL-based workflows they generate. We begin by presenting the first-ever taxonomy of incidences of defects in FM-generated DSL workflows, categorizing them into 18 distinct types. Furthermore, we observe a high prevalence of defects across FM-generated DSL workflows, with 87.27% of the studied instances containing at least one defect. This, in turn, emphasizes the magnitude of the problem in practice and underscores the necessity for implementing mitigation strategies. Following this, we demonstrate that nine types of these defects can be effectively identified through static analysis of the workflows. For this purpose, we develop Timon, the first-of-its-kind static analyzer specifically designed for FM-generated DSL workflows. Finally, we show that by incorporating feedback from Timon, we can guide Pumbaa, an FM-based tool, to repair the detected defect incidences. By systematically detecting and repairing defects, our work provides a crucial step towards the reliable and automated generation of executable workflows from natural language requirements.
中文:本研究提出一种利用静态分析反馈的框架,帮助基础模型检测和修复领域特定语言工作流中的缺陷,针对缺陷高发问题推进了从自然语言需求生成可靠自动化工作流的进程。
English: This study introduces a framework that uses static analysis feedback to help Foundation Models detect and repair defects in domain-specific language workflows, addressing the high prevalence of defects and advancing reliable automated workflow generation from natural language.

Authors:Yutong Hao, Chen Chen, Ajmal Saeed Mian, Chang Xu, Daochang Liu
Title: Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
Abstract:
Diffusion models can generate realistic videos, but existing methods rely on implicitly learning physical reasoning from large-scale text-video datasets, which is costly, difficult to scale, and still prone to producing implausible motions that violate fundamental physical laws. We introduce a training-free framework that improves physical plausibility at inference time by explicitly reasoning about implausibility and guiding the generation away from it. Specifically, we employ a lightweight physics-aware reasoning pipeline to construct counterfactual prompts that deliberately encode physics-violating behaviors. Then, we propose a novel Synchronized Decoupled Guidance (SDG) strategy, which leverages these prompts through synchronized directional normalization to counteract lagged suppression and trajectory-decoupled denoising to mitigate cumulative trajectory bias, ensuring that implausible content is suppressed immediately and consistently throughout denoising. Experiments across different physical domains show that our approach substantially enhances physical fidelity while maintaining photorealism, despite requiring no additional training. Ablation studies confirm the complementary effectiveness of both the physics-aware reasoning component and SDG. In particular, the aforementioned two designs of SDG are also individually validated to contribute critically to the suppression of implausible content and the overall gains in physical plausibility. This establishes a new and plug-and-play physics-aware paradigm for video generation.
Chinese Summary: 本文提出了一种无需训练的框架,通过反事实提示显式识别违反物理规律的行为,并采用同步解耦引导策略在去噪过程中抑制不合理内容,从而显著提升视频生成的物理合理性。
English Summary: The paper presents a training-free framework that enhances physical plausibility in video generation by explicitly identifying physics-violating behaviors through counterfactual prompts and employing a novel Synchronized Decoupled Guidance strategy to suppress implausible content during denoising.

Authors:Zhihao Wang, Jianxiong Li, Jinliang Zheng, Wencong Zhang, Dongxiu Liu, Yinan Zheng, Haoyi Niu, Junzhi Yu, Xianyuan Zhan
Title: PhysiAgent: An Embodied Agent Framework in Physical World
Abstract:
Vision-Language-Action (VLA) models have achieved notable success but often struggle with limited generalizations. To address this, integrating generalized Vision-Language Models (VLMs) as assistants to VLAs has emerged as a popular solution. However, current approaches often combine these models in rigid, sequential structures: using VLMs primarily for high-level scene understanding and task planning, and VLAs merely as executors of lower-level actions, leading to ineffective collaboration and poor grounding challenges. In this paper, we propose an embodied agent framework, PhysiAgent, tailored to operate effectively in physical environments. By incorporating monitor, memory, self-reflection mechanisms, and lightweight off-the-shelf toolboxes, PhysiAgent offers an autonomous scaffolding framework to prompt VLMs to organize different components based on real-time proficiency feedback from VLAs to maximally exploit VLAs' capabilities. Experimental results demonstrate significant improvements in task-solving performance on complex real-world robotic tasks, showcasing effective self-regulation of VLMs, coherent tool collaboration, and adaptive evolution of the framework during execution. PhysiAgent makes practical and pioneering efforts to integrate VLMs and VLAs, effectively grounding embodied agent frameworks in real-world settings.
中文摘要:PhysiAgent框架通过整合视觉语言模型与实时反馈机制,有效提升了视觉-语言-行动模型在物理环境中的任务执行能力和自适应协作水平。
English Summary: The PhysiAgent framework enhances Vision-Language-Action models by integrating Vision-Language Models with real-time feedback mechanisms, significantly improving task performance and adaptive collaboration in physical environments.

Authors:Zhaomin Wu, Haodong Zhao, Ziyang Wang, Jizhou Guo, Qian Wang, Bingsheng He
Title: LLM DNA: Tracing Model Evolution via Functional Representations
Abstract:
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism properties and establish the existence of DNA. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on specific tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
中文: 该研究提出LLM DNA作为大语言模型功能行为的数学表征,通过可扩展、免训练的提取与比较方法,揭示了模型间未记录的演化关系并构建出精确的系统发育树。
English: The study introduces LLM DNA, a mathematical representation of large language models' functional behavior, enabling scalable, training-free extraction and comparison that reveals undocumented evolutionary relationships and constructs accurate phylogenetic trees.

Authors:Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Title: Muon: Training and Trade-offs with Latent Attention and MoE
Abstract:
We present a comprehensive theoretical and empirical study of the Muon optimizer for training transformers only with a small to medium decoder (30M - 200M parameters), with an emphasis on its mathematical foundations, convergence properties and synergistic interactions with modern architectural optimizations. Building on recent work showing Muon's scalability, we provide rigorous theoretical analysis including: (i)showing the convergence rate under standard assumptions, (ii) spectral regularization properties that prevent gradient explosion, (iii) connection to natural gradient descent on the Stiefel manifold, and (iv) equivalence to steepest gradient descent under the spectral norm. Crucially, we demonstrate that Muon expands the Pareto frontier in the compute-time trade-off by maintaining superior data efficiency at large batch sizes, a key finding of~\cite{essentialai2025muon} that we validate across our model scales. Empirically, Muon reaches the target loss with 48-52\% of the training calculated by AdamW while maintaining or improving the final perplexity, consistent with larger-scale results. When combined with Multi-Head Latent Attention (MLA) and Mixture-of-Experts (MoE), we observe multiplicative efficiency gains: MLA+MoE+Muon achieves 68\% memory reduction and 3.2$\times$ inference speedup, while improving perplexity by 8-12\%. We provide detailed procedures on 15 architectural and optimizer components, stability analyzes across 100+ training runs, and practical implementation guidelines including Newton-Schulz coefficients $(3.4445, -4.7750, 2.0315)$ optimized by~\cite{su2024muonblog}. Our theoretical analysis and comprehensive experiments establish Muon as a principled, robust alternative to AdamW that particularly excels when combined with modern efficiency techniques and large-batch training regimes.
中文: Muon优化器在训练中小型解码器时展现出卓越的数据效率和收敛速度,相比AdamW减少近半训练计算量且保持或提升性能,尤其与MLA和MoE等现代架构优化技术结合时,能实现内存降低68%和推理加速3.2倍的倍增效益。
English: The Muon optimizer demonstrates superior data efficiency and faster convergence than AdamW, reducing training computation by nearly half while maintaining or improving model performance, especially when combined with modern architectural optimizations like MLA and MoE for multiplicative gains in memory and speed.

Authors:Amira Guesmi, Muhammad Shafique
Title: DRIFT: Divergent Response in Filtered Transformations for Robust Adversarial Defense
Abstract:
Deep neural networks remain highly vulnerable to adversarial examples, and most defenses collapse once gradients can be reliably estimated. We identify \emph{gradient consensus} -- the tendency of randomized transformations to yield aligned gradients -- as a key driver of adversarial transferability. Attackers exploit this consensus to construct perturbations that remain effective across transformations. We introduce \textbf{DRIFT} (Divergent Response in Filtered Transformations), a stochastic ensemble of lightweight, learnable filters trained to actively disrupt gradient consensus. Unlike prior randomized defenses that rely on gradient masking, DRIFT enforces \emph{gradient dissonance} by maximizing divergence in Jacobian- and logit-space responses while preserving natural predictions. Our contributions are threefold: (i) we formalize gradient consensus and provide a theoretical analysis linking consensus to transferability; (ii) we propose a consensus-divergence training strategy combining prediction consistency, Jacobian separation, logit-space separation, and adversarial robustness; and (iii) we show that DRIFT achieves substantial robustness gains on ImageNet across CNNs and Vision Transformers, outperforming state-of-the-art preprocessing, adversarial training, and diffusion-based defenses under adaptive white-box, transfer-based, and gradient-free attacks. DRIFT delivers these improvements with negligible runtime and memory cost, establishing gradient divergence as a practical and generalizable principle for adversarial defense.
中文: 深度神经网络因梯度共识易受对抗性攻击,而DRIFT通过随机滤波器集成强制梯度异响,在保持预测准确性的同时显著提升了模型在多种攻击下的鲁棒性,且计算成本极低。
English: Deep neural networks are vulnerable to adversarial attacks due to gradient consensus, which DRIFT counters by using a stochastic ensemble of filters to enforce gradient dissonance, significantly improving robustness across various models and attack types with minimal computational cost.

Authors:Satyanarayana Raju G. V., Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Title: A study of Universal ODE approaches to predicting soil organic carbon
Abstract:
Soil Organic Carbon (SOC) is a foundation of soil health and global climate resilience, yet its prediction remains difficult because of intricate physical, chemical, and biological processes. In this study, we explore a Scientific Machine Learning (SciML) framework built on Universal Differential Equations (UDEs) to forecast SOC dynamics across soil depth and time. UDEs blend mechanistic physics, such as advection diffusion transport, with neural networks that learn nonlinear microbial production and respiration. Using synthetic datasets, we systematically evaluated six experimental cases, progressing from clean, noise free benchmarks to stress tests with high (35%) multiplicative, spatially correlated noise. Our results highlight both the potential and limitations of the approach. In noise free and moderate noise settings, the UDE accurately reconstructed SOC dynamics. In clean terminal profile at 50 years (Case 4) achieved near perfect fidelity, with MSE = 1.6e-5, and R2 = 0.9999. Case 5, with 7% noise, remained robust (MSE = 3.4e-6, R2 = 0.99998), capturing depth wise SOC trends while tolerating realistic measurement uncertainty. In contrast, Case 3 (35% noise at t = 0) showed clear evidence of overfitting: the model reproduced noisy inputs with high accuracy but lost generalization against the clean truth (R2 = 0.94). Case 6 (35% noise at t = 50) collapsed toward overly smooth mean profiles, failing to capture depth wise variability and yielding negative R2, underscoring the limits of standard training under severe uncertainty. These findings suggest that UDEs are well suited for scalable, noise tolerant SOC forecasting, though advancing toward field deployment will require noise aware loss functions, probabilistic modelling, and tighter integration of microbial dynamics.
中文: 本研究采用基于通用微分方程的科学机器学习框架预测土壤有机碳动态,在无噪声和中等噪声条件下表现出高精度,但在高噪声环境下显示出过拟合和泛化能力下降等局限性。
English: This study employs a Scientific Machine Learning framework based on Universal Differential Equations to predict soil organic carbon dynamics, demonstrating high accuracy in noise-free and moderate-noise conditions but revealing limitations like overfitting and loss of generalization under high noise levels.

Authors:Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, Ting Wang
Title: Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models
Abstract:
Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning. However, recent studies reveal that their final answers often contradict their own reasoning traces. We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval. To test this hypothesis, we conduct controlled experiments that challenge LRMs with misleading cues during reasoning and/or corrupted answers during retrieval. Our results across models and datasets confirm that both mechanisms operate simultaneously, with their relative dominance influenced by multiple factors: problem domains, model scales, and fine-tuning approaches (e.g., reinforcement learning vs. distillation). The findings reveal a critical limitation in current reasoning fine-tuning paradigms: models can exploit the retrieval mechanism as a shortcut, effectively "hacking" the reward signal and undermining genuine reasoning development. To address this challenge, we introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning. By carefully suppressing retrieval shortcuts during the fine-tuning process, FARL promotes reasoning-dominant behavior and enhances generalizable reasoning capabilities.
中文摘要:大型推理模型存在思维链推理与记忆检索机制之间的冲突,而提出的FARL框架通过抑制检索捷径来增强真正的推理能力。
English Summary: Large reasoning models exhibit a conflict between chain-of-thought reasoning and memory retrieval mechanisms, which the proposed FARL framework addresses by suppressing retrieval shortcuts to enhance genuine reasoning capabilities.

Authors:Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu, Tianhao Peng, Xinping Lei, Weihao Li, Jingxuan Xu, Kun Wu, Yifan Yao, Haoyang Huang, Huaixi Tang, Kepeng Lei, Zhiyi Lai, Songwei Yu, Zongxian Feng, Zuchen Gao, Weihao Xie, Chenchen Zhang, Yanan Wu, Yuanxing Zhang, Lecheng Huang, Yuqun Zhang, Jie Liu, Zhaoxiang Zhang, Haotian Zhang, Bin Chen, Jiaheng Liu
Title: HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs
Abstract:
Large Language Models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to improve accuracy on complex tasks. However, always generating lengthy reasoning traces is inefficient, leading to excessive token usage and higher inference costs. This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control that enables LLMs to selectively decide when to engage in detailed reasoning (Think-on) and when to respond directly (Think-off). Specifically, HiPO combines a hybrid data pipelineproviding paired Think-on and Think-off responseswith a hybrid reinforcement learning reward system that balances accuracy and efficiency while avoiding over-reliance on detailed reasoning. Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy. Finally, we hope HiPO a can be a principled approach for efficient adaptive reasoning, advancing the deployment of reasoning-oriented LLMs in real-world, resource-sensitive settings.
中文: 本文提出的HiPO框架让大语言模型能自适应选择详细推理或直接回应,在数学和编程任务中显著减少标记使用量的同时保持或提升准确率。
English: This paper introduces HiPO, a framework that enables Large Language Models to adaptively choose between detailed reasoning and direct responses, significantly reducing token usage while maintaining or improving accuracy across mathematical and coding tasks.

Authors:Ky Dan Nguyen, Hoang Lam Tran, Anh-Dung Dinh, Daochang Liu, Weidong Cai, Xiuying Wang, Chang Xu
Title: Not All Tokens are Guided Equal: Improving Guidance in Visual Autoregressive Models
Abstract:
Autoregressive (AR) models based on next-scale prediction are rapidly emerging as a powerful tool for image generation, but they face a critical weakness: information inconsistencies between patches across timesteps introduced by progressive resolution scaling. These inconsistencies scatter guidance signals, causing them to drift away from conditioning information and leaving behind ambiguous, unfaithful features. We tackle this challenge with Information-Grounding Guidance (IGG), a novel mechanism that anchors guidance to semantically important regions through attention. By adaptively reinforcing informative patches during sampling, IGG ensures that guidance and content remain tightly aligned. Across both class-conditioned and text-to-image generation tasks, IGG delivers sharper, more coherent, and semantically grounded images, setting a new benchmark for AR-based methods.
中文摘要:本文提出信息锚定引导(IGG)机制,通过注意力将引导信号固定于语义关键区域,有效解决自回归图像生成中渐进分辨率缩放引发的信息不一致问题,在各类生成任务中实现了更清晰、连贯且语义准确的图像。
English Summary: The paper introduces Information-Grounding Guidance (IGG), a novel mechanism that addresses information inconsistencies in autoregressive image generation models by anchoring guidance to semantically important regions through attention, resulting in sharper and more coherent images across various tasks.

Authors:Yiheng Huang, Junran Peng, Silei Shen, Jingwei Yang, ZeJi Wei, ChenCheng Bai, Yonghao He, Wei Sui, Muyi Sun, Yan Liu, Xu-Cheng Yin, Man Zhang, Zhaoxiang Zhang, Chuanchen Luo
Title: SIG-Chat: Spatial Intent-Guided Conversational Gesture Generation Involving How, When and Where
Abstract:
The accompanying actions and gestures in dialogue are often closely linked to interactions with the environment, such as looking toward the interlocutor or using gestures to point to the described target at appropriate moments. Speech and semantics guide the production of gestures by determining their timing (WHEN) and style (HOW), while the spatial locations of interactive objects dictate their directional execution (WHERE). Existing approaches either rely solely on descriptive language to generate motions or utilize audio to produce non-interactive gestures, thereby lacking the characterization of interactive timing and spatial intent. This significantly limits the applicability of conversational gesture generation, whether in robotics or in the fields of game and animation production. To address this gap, we present a full-stack solution. We first established a unique data collection method to simultaneously capture high-precision human motion and spatial intent. We then developed a generation model driven by audio, language, and spatial data, alongside dedicated metrics for evaluating interaction timing and spatial accuracy. Finally, we deployed the solution on a humanoid robot, enabling rich, context-aware physical interactions.
中文摘要:本研究提出了一种全栈解决方案,通过整合音频、语言和空间数据来生成交互式对话手势,解决了现有方法在捕捉交互时机和空间意图方面的不足,并在仿人机器人上成功验证了该方案的实用性。
English Summary: This study introduces a comprehensive solution for generating interactive conversational gestures by integrating audio, language, and spatial data to address the limitations of existing methods in capturing timing and spatial intent, which was validated through deployment on a humanoid robot.

Authors:Taiqiang Wu, Runming Yang, Tao Liu, Jiahao Wang, Zenan Xu, Ngai Wong
Title: Timber: Training-free Instruct Model Refining with Base via Effective Rank
Abstract:
Post-training, which elicits a pretrained Base model into the corresponding Instruct model, is widely considered to be superficial. In this work, we first reinforce this hypothesis by providing novel quantitative evidence from the weight level that the effective rank (eRank) remains negligibly changed. However, this superficiality also suffers a critical trade-off, improving the exploitation capabilities at the cost of limiting its exploration. To tackle this issue, we propose Timber, a simple yet effective training-free method that enhances the exploration capability of the Instruct model while preserving its exploitation. The key insight is to partially revert Instruct towards the paired Base model by subtle yet targeted refinement of the weight deltas. Extensive experiments on Llama and Qwen series demonstrate that Timber consistently improves vanilla Instruct models, particularly on Pass@k performance. Our findings offer new insights into the post-training stage at the weight level and practical strategies to refine the Instruct model without training.
中文: 后训练被证实是表面的,有效秩变化微小,但限制了探索能力;Timber方法通过微调权重增量来增强探索而不损害利用,在Llama和Qwen模型上得到验证。
English: Post-training is shown to be superficial with minimal changes in effective rank, yet it limits exploration, which Timber addresses by refining weight deltas to enhance exploration without compromising exploitation, as validated on Llama and Qwen models.

Authors:Emma Kondrup, Sebastian Sabry, Hussein Abdallah, Zachary Yang, James Zhou, Kellin Pelrine, Jean-François Godbout, Michael M. Bronstein, Reihaneh Rabbany, Shenyang Huang
Title: CrediBench: Building Web-Scale Network Datasets for Information Integrity
Abstract:
Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persuasive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.
中文摘要:CrediBench提出了一种大规模时序网络图谱流程,通过联合建模文本内容与超链接结构来动态追踪虚假信息的演变,构建了包含4500万节点和10亿边的最大的公开数据集,证明了结合内容与结构信号能有效提升可信度检测能力。
English Summary: CrediBench introduces a large-scale temporal web graph pipeline that jointly models textual content and hyperlink structure to dynamically track misinformation evolution, creating the largest public dataset with 45M nodes and 1B edges to demonstrate enhanced credibility detection through combined content and structural signals.

Authors:Francesco Marchiori, Rohan Sinha, Christopher Agia, Alexander Robey, George J. Pappas, Mauro Conti, Marco Pavone
Title: Preventing Robotic Jailbreaking via Multimodal Domain Adaptation
Abstract:
Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly deployed in robotic environments but remain vulnerable to jailbreaking attacks that bypass safety mechanisms and drive unsafe or physically harmful behaviors in the real world. Data-driven defenses such as jailbreak classifiers show promise, yet they struggle to generalize in domains where specialized datasets are scarce, limiting their effectiveness in robotics and other safety-critical contexts. To address this gap, we introduce J-DAPT, a lightweight framework for multimodal jailbreak detection through attention-based fusion and domain adaptation. J-DAPT integrates textual and visual embeddings to capture both semantic intent and environmental grounding, while aligning general-purpose jailbreak datasets with domain-specific reference data. Evaluations across autonomous driving, maritime robotics, and quadruped navigation show that J-DAPT boosts detection accuracy to nearly 100% with minimal overhead. These results demonstrate that J-DAPT provides a practical defense for securing VLMs in robotic applications. Additional materials are made available at: https://j-dapt.github.io.
中文摘要:J-DAPT是一个轻量级多模态框架,通过注意力融合和领域适配技术整合文本与视觉特征,在自主驾驶等机器人应用中实现接近100%的越狱攻击检测精度。
English Summary: J-DAPT is a lightweight multimodal framework that enhances jailbreak detection in robotic systems by fusing textual and visual embeddings with domain adaptation, achieving near-perfect accuracy across various autonomous applications.

Authors:Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Tomas Pfister, Hamid Palangi
Title: HEART: Emotionally-driven test-time scaling of Language Models
Abstract:
Test-time scaling has shown considerable success in improving the performance of language models on complex reasoning tasks without requiring fine-tuning. However, current strategies such as self-reflection primarily focus on logical or structural refinement. They do not leverage the guiding potential of affective feedback. Inspired by psychological research showing that emotions can modulate cognitive performance, we introduce HEART--a novel framework that uses emotionally-driven prompts for iterative self-correction. HEART provides feedback on a model's incorrect response using a curated set of concise, emotionally charged phrases based on the six universal emotions categorized by Dr. Paul Ekman. By systematically varying the emotional tone of the feedback across iterations, our method guides the model to escape flawed reasoning paths and explore more promising alternatives. We evaluate our framework on challenging reasoning benchmarks including OlympiadBench, Humanity's Last Exam, and SimpleQA. Our results reveal a significant new phenomenon: when guided by an oracle verifier, this affective iteration protocol unlocks significantly deeper reasoning, leading to consistent and substantial increases in accuracy over state-of-the-art baselines with the same verifier. However, we also identify a critical bottleneck for practical deployment. In a verifier-free setting, it struggles to harness these gains consistently, highlighting as a key challenge for future work. Our findings suggest that the next frontier in machine reasoning may lie not just in refining logic, but also in understanding and leveraging the `HEART' of the models.
中文摘要:HEART是一种创新框架,通过情感驱动提示引导语言模型进行迭代自我修正,在配备验证器时显著提升推理准确率,但在无验证器场景下仍面临应用挑战。
English summary: HEART is a novel framework that uses emotionally-driven prompts to guide language models in iterative self-correction, significantly improving reasoning accuracy when paired with an oracle verifier, though it faces challenges in verifier-free settings.

Authors:Kai Zhang, Christopher Malon, Lichao Sun, Martin Renqiang Min
Title: EditGRPO: Reinforcement Learning with Post -Rollout Edits for Clinically Accurate Chest X-Ray Report Generation
Abstract:
Radiology report generation requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. Although recent innovations, particularly multimodal large language models (MLLMs), have shown improved performance, their supervised fine-tuning (SFT) objective is not explicitly aligned with clinical efficacy. In this work, we introduce EditGRPO, a mixed-policy reinforcement learning (RL) algorithm designed specifically to optimize the generation through clinically motivated rewards. EditGRPO integrates on-policy exploration with off-policy guidance by injecting sentence-level detailed corrections during training rollouts. This mixed-policy approach addresses the exploration dilemma and sampling efficiency issues typically encountered in RL. Applied to a Qwen2.5-VL-3B MLLM initialized with supervised fine-tuning (SFT), EditGRPO outperforms both SFT and vanilla GRPO baselines, achieving an average improvement of 3.4% in CheXbert, GREEN, Radgraph, and RATEScore metrics across four major chest X-ray report generation datasets. Notably, EditGRPO also demonstrates superior out-of-domain generalization, with an average performance gain of 5.9% on unseen datasets.
Chinese: EditGRPO是一种混合策略强化学习算法,通过句子级修正优化临床效果来改进放射学报告生成,在性能指标上取得显著提升,并在未见数据集上展现出卓越的泛化能力。
English: EditGRPO is a mixed-policy reinforcement learning algorithm that enhances radiology report generation by optimizing clinical efficacy through sentence-level corrections, achieving significant improvements in performance metrics and superior generalization on unseen datasets.

Authors:Karim Khamaisi, Oliver Kamer, Bruno Rodrigues, Jan von der Assen, Burkhard Stiller
Title: Bridging Technical Capability and User Accessibility: Off-grid Civilian Emergency Communication
Abstract:
During large-scale crises disrupting cellular and Internet infrastructure, civilians lack reliable methods for communication, aid coordination, and access to trustworthy information. This paper presents a unified emergency communication system integrating a low-power, long-range network with a crisis-oriented smartphone application, enabling decentralized and off-grid civilian communication. Unlike previous solutions separating physical layer resilience from user layer usability, our design merges these aspects into a cohesive crisis-tailored framework. The system is evaluated in two dimensions: communication performance and application functionality. Field experiments in urban Zürich demonstrate that the 868 MHz band, using the LongFast configuration, achieves a communication range of up to 1.2 km with 92% Packet Delivery Ratio, validating network robustness under real-world infrastructure degraded conditions. In parallel, a purpose-built mobile application featuring peer-to-peer messaging, identity verification, and community moderation was evaluated through a requirements-based analysis.
中文: 本文提出了一种集成了低功耗远距离网络与危机专用智能手机应用的统一应急通信系统,能够在基础设施中断时实现去中心化的民用通信,苏黎世的实地测试验证了其通信性能与应用功能的可靠性。
English: This paper introduces an integrated emergency communication system that combines a low-power, long-range network with a crisis-specific smartphone app, enabling decentralized civilian communication and information sharing during infrastructure failures, with field tests in Zurich confirming its robust performance and functionality.

Authors:Xiaocheng Zou, Shijin Duan, Charles Fleming, Gaowen Liu, Ramana Rao Kompella, Shaolei Ren, Xiaolin Xu
Title: ConQuER: Modular Architectures for Control and Bias Mitigation in IQP Quantum Generative Models
Abstract:
Quantum generative models based on instantaneous quantum polynomial (IQP) circuits show great promise in learning complex distributions while maintaining classical trainability. However, current implementations suffer from two key limitations: lack of controllability over generated outputs and severe generation bias towards certain expected patterns. We present a Controllable Quantum Generative Framework, ConQuER, which addresses both challenges through a modular circuit architecture. ConQuER embeds a lightweight controller circuit that can be directly combined with pre-trained IQP circuits to precisely control the output distribution without full retraining. Leveraging the advantages of IQP, our scheme enables precise control over properties such as the Hamming Weight distribution with minimal parameter and gate overhead. In addition, inspired by the controller design, we extend this modular approach through data-driven optimization to embed implicit control paths in the underlying IQP architecture, significantly reducing generation bias on structured datasets. ConQuER retains efficient classical training properties and high scalability. We experimentally validate ConQuER on multiple quantum state datasets, demonstrating its superior control accuracy and balanced generation performance, only with very low overhead cost over original IQP circuits. Our framework bridges the gap between the advantages of quantum computing and the practical needs of controllable generation modeling.
中文: ConQuER提出了一种可控量子生成框架,通过模块化设计解决了IQP电路在输出可控性和生成偏差方面的局限,能以极低开销实现精确输出控制并保持高效经典训练与可扩展性。
English: ConQuER introduces a controllable quantum generative framework that overcomes the limitations of IQP circuits by enabling precise output control and reducing generation bias through a modular design, maintaining efficient classical training and scalability with minimal overhead.

Authors:Hua Yuan, Xuran Meng, Qiufeng Wang, Shiyu Xia, Ning Xu, Xu Yang, Jing Wang, Xin Geng, Yong Rui
Title: Towards Understanding Feature Learning in Parameter Transfer
Abstract:
Parameter transfer is a central paradigm in transfer learning, enabling knowledge reuse across tasks and domains by sharing model parameters between upstream and downstream models. However, when only a subset of parameters from the upstream model is transferred to the downstream model, there remains a lack of theoretical understanding of the conditions under which such partial parameter reuse is beneficial and of the factors that govern its effectiveness. To address this gap, we analyze a setting in which both the upstream and downstream models are ReLU convolutional neural networks (CNNs). Within this theoretical framework, we characterize how the inherited parameters act as carriers of universal knowledge and identify key factors that amplify their beneficial impact on the target task. Furthermore, our analysis provides insight into why, in certain cases, transferring parameters can lead to lower test accuracy on the target task than training a new model from scratch. Numerical experiments and real-world data experiments are conducted to empirically validate our theoretical findings.
中文: 本研究对ReLU卷积神经网络中的部分参数迁移进行了理论分析,揭示了增强知识重用的关键因素,并解释了参数迁移在某些情况下可能不如从头训练的原因,实验数据验证了理论结果。
English: This study provides a theoretical analysis of partial parameter transfer in ReLU CNNs, identifying key factors that enhance knowledge reuse and explaining scenarios where transfer may underperform training from scratch, with empirical validation supporting the findings.

Authors:Bochuan Cao, Changjiang Li, Yuanpu Cao, Yameng Ge, Ting Wang, Jinghui Chen
Title: You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors
Abstract:
Large language models (LLMs) have been widely adopted across various applications, leveraging customized system prompts for diverse tasks. Facing potential system prompt leakage risks, model developers have implemented strategies to prevent leakage, primarily by disabling LLMs from repeating their context when encountering known attack patterns. However, it remains vulnerable to new and unforeseen prompt-leaking techniques. In this paper, we first introduce a simple yet effective prompt leaking attack to reveal such risks. Our attack is capable of extracting system prompts from various LLM-based application, even from SOTA LLM models such as GPT-4o or Claude 3.5 Sonnet. Our findings further inspire us to search for a fundamental solution to the problems by having no system prompt in the context. To this end, we propose SysVec, a novel method that encodes system prompts as internal representation vectors rather than raw text. By doing so, SysVec minimizes the risk of unauthorized disclosure while preserving the LLM's core language capabilities. Remarkably, this approach not only enhances security but also improves the model's general instruction-following abilities. Experimental results demonstrate that SysVec effectively mitigates prompt leakage attacks, preserves the LLM's functional integrity, and helps alleviate the forgetting issue in long-context scenarios.
中文: 本文提出一种简单有效的提示词泄露攻击,能够从GPT-4o和Claude 3.5 Sonnet等先进大模型中提取系统提示,同时创新性地开发了SysVec方法,通过将系统提示编码为内部向量来增强安全性并保持模型性能。
English: This paper introduces a simple yet effective prompt leakage attack that can extract system prompts from advanced LLMs like GPT-4o and Claude 3.5 Sonnet, and proposes SysVec, a novel method that encodes system prompts as internal vectors to enhance security while maintaining model performance.

Authors:Yixin Wan, Xingrun Chen, Kai-Wei Chang
Title: Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs
Abstract:
Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM's default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.
中文摘要:大型语言模型存在文化定位偏差,倾向于美国主流文化视角而将其他文化视为外来者,但通过基于智能体的缓解框架,利用专门代理进行批判和优化,可有效减少这种生成偏见。
English Summary: Large language models exhibit cultural positioning bias by favoring mainstream US perspectives and treating other cultures as outsiders, but this can be mitigated through agent-based frameworks that employ specialized agents to critique and refine generated content for fairness.

Authors:Taehee Park, Heejin Do, Gary Geunbae Lee
Title: Leveraging What's Overfixed: Post-Correction via LLM Grammatical Error Overcorrection
Abstract:
Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.
中文:PoCO是一种创新方法,先利用大语言模型故意过度修正以提高召回率,再通过小模型精修输出以保持精确度,从而有效平衡语法纠错的整体性能。
English: PoCO is a novel method that leverages large language models to intentionally overcorrect for high recall, then refines outputs with small models to maintain precision, effectively balancing grammatical error correction performance.

Authors:Kai Zhang, Corey D Barrett, Jangwon Kim, Lichao Sun, Tara Taghavi, Krishnaram Kenthapadi
Title: RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows
Abstract:
Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. To bridge the above gaps, we present RadAgents, a multi-agent framework for CXR interpretation that couples clinical priors with task-aware multimodal reasoning. In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice.
中文: 智能体系统虽能通过专业协作处理复杂临床任务,但现有胸片解读方法缺乏可解释推理、多模态融合及冲突解决机制,因此提出RadAgents框架,结合临床先验与验证机制确保输出可靠性。
English: Agentic systems can address complex clinical tasks via specialized collaboration but current chest X-ray methods lack interpretable reasoning, multimodal fusion, and conflict resolution, leading to the proposed RadAgents framework that integrates clinical priors with verification mechanisms for reliable outputs.

Authors:Kota Dohi, Tomoya Nishida, Harsh Purohit, Takashi Endo, Yohei Kawaguchi
Title: DiffNator: Generating Structured Explanations of Time-Series Differences
Abstract:
In many IoT applications, the central interest lies not in individual sensor signals but in their differences, yet interpreting such differences requires expert knowledge. We propose DiffNator, a framework for structured explanations of differences between two time series. We first design a JSON schema that captures the essential properties of such differences. Using the Time-series Observations of Real-world IoT (TORI) dataset, we generate paired sequences and train a model that combine a time-series encoder with a frozen LLM to output JSON-formatted explanations. Experimental results show that DiffNator generates accurate difference explanations and substantially outperforms both a visual question answering (VQA) baseline and a retrieval method using a pre-trained time-series encoder.
中文:DiffNator框架通过设计JSON模式并训练结合时间序列编码器与冻结大语言模型的系统,生成两个时间序列差异的结构化解释,实验证明其准确性显著优于视觉问答基线和预训练检索方法。
English: DiffNator is a framework that uses a JSON schema and a trained model combining a time-series encoder with a frozen LLM to generate structured explanations of differences between two time series, demonstrating superior accuracy over baseline methods in experiments.

Authors:Rushuai Yang, Hangxing Wei, Ran Zhang, Zhiyuan Feng, Xiaoyu Chen, Tong Li, Chuheng Zhang, Li Zhao, Jiang Bian, Xiu Su, Yi Chen
Title: Beyond Human Demonstrations: Diffusion-Based Reinforcement Learning to Generate Data for VLA Training
Abstract:
Vision-language-action (VLA) models have shown strong generalization across tasks and embodiments; however, their reliance on large-scale human demonstrations limits their scalability owing to the cost and effort of manual data collection. Reinforcement learning (RL) offers a potential alternative to generate demonstrations autonomously, yet conventional RL algorithms often struggle on long-horizon manipulation tasks with sparse rewards. In this paper, we propose a modified diffusion policy optimization algorithm to generate high-quality and low-variance trajectories, which contributes to a diffusion RL-powered VLA training pipeline. Our algorithm benefits from not only the high expressiveness of diffusion models to explore complex and diverse behaviors but also the implicit regularization of the iterative denoising process to yield smooth and consistent demonstrations. We evaluate our approach on the LIBERO benchmark, which includes 130 long-horizon manipulation tasks, and show that the generated trajectories are smoother and more consistent than both human demonstrations and those from standard Gaussian RL policies. Further, training a VLA model exclusively on the diffusion RL-generated data achieves an average success rate of 81.9%, which outperforms the model trained on human data by +5.3% and that on Gaussian RL-generated data by +12.6%. The results highlight our diffusion RL as an effective alternative for generating abundant, high-quality, and low-variance demonstrations for VLA models.
中文: 本文提出一种改进的扩散策略优化算法,能为视觉-语言-动作模型生成平滑连贯的轨迹,在长程操作任务上表现优于人类示范数据和传统强化学习方法。
English: This paper introduces a modified diffusion policy optimization algorithm that generates smooth, consistent trajectories for vision-language-action models, achieving superior performance on long-horizon manipulation tasks compared to human demonstrations and conventional reinforcement learning methods.

Authors:Rushuai Yang, Hangxing Wei, Ran Zhang, Zhiyuan Feng, Xiaoyu Chen, Tong Li, Chuheng Zhang, Li Zhao, Jiang Bian, Xiu Su, Yi Chen
Title: Beyond Human Demonstrations: Diffusion-Based Reinforcement Learning to Generate Data for VLA Training
Abstract:
Vision-language-action (VLA) models have shown strong generalization across tasks and embodiments; however, their reliance on large-scale human demonstrations limits their scalability owing to the cost and effort of manual data collection. Reinforcement learning (RL) offers a potential alternative to generate demonstrations autonomously, yet conventional RL algorithms often struggle on long-horizon manipulation tasks with sparse rewards. In this paper, we propose a modified diffusion policy optimization algorithm to generate high-quality and low-variance trajectories, which contributes to a diffusion RL-powered VLA training pipeline. Our algorithm benefits from not only the high expressiveness of diffusion models to explore complex and diverse behaviors but also the implicit regularization of the iterative denoising process to yield smooth and consistent demonstrations. We evaluate our approach on the LIBERO benchmark, which includes 130 long-horizon manipulation tasks, and show that the generated trajectories are smoother and more consistent than both human demonstrations and those from standard Gaussian RL policies. Further, training a VLA model exclusively on the diffusion RL-generated data achieves an average success rate of 81.9%, which outperforms the model trained on human data by +5.3% and that on Gaussian RL-generated data by +12.6%. The results highlight our diffusion RL as an effective alternative for generating abundant, high-quality, and low-variance demonstrations for VLA models.
中文: 本文提出一种改进的扩散策略优化算法,能为视觉-语言-动作模型生成平滑连贯的轨迹,在长程操作任务上表现优于人类示范数据和传统强化学习方法。
English: This paper introduces a modified diffusion policy optimization algorithm that generates smooth, consistent trajectories for vision-language-action models, achieving superior performance on long-horizon manipulation tasks compared to human demonstrations and conventional reinforcement learning methods.

Authors:Yanfang Fanny Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Patricia Culligan, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Tom Stapleford, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh Chawla
Title: LLMs4All: A Review on Large Language Models for Research and Applications in Academic Disciplines
Abstract:
Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.
中文: 本文综述了前沿大语言模型及其在多学科领域的整合应用,探讨了它们对研究和实践的变革性影响,同时分析了生成式人工智能时代的关键挑战与未来方向。
English: This paper provides a comprehensive overview of state-of-the-art Large Language Models (LLMs) and their integration across diverse academic disciplines, exploring their transformative potential in research and practice while addressing limitations and future directions in the generative AI era.

Authors:Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine
Title: OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation
Abstract:
Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models. We present videos showcasing OmniVLA performance and will release its checkpoints and training code on our project page.
中文: 本文提出的OmniVLA机器人基础模型通过随机模态融合策略,整合二维位姿、图像和语言等多种目标模态进行视觉导航训练,实现了优越的泛化能力,其性能超越各专业基线模型。
English: This paper introduces OmniVLA, a robotic foundation model that enables flexible vision-based navigation by training with multiple goal modalities—including 2D poses, images, and language—through a randomized fusion strategy, achieving strong generalization and outperforming specialist baselines.

Authors:Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, Xuanchi Ren
Title: Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation
Abstract:
The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
中文: 本文提出一种自蒸馏框架,将视频扩散模型中的隐式3D知识提炼为显式的3D高斯溅射表示,无需多视角训练数据即可实现最先进的静态与动态3D场景生成。
English: This paper introduces a self-distillation framework that extracts implicit 3D knowledge from video diffusion models to create explicit 3D Gaussian Splatting representations, enabling state-of-the-art static and dynamic 3D scene generation without requiring multi-view training data.

Authors:Zheyuan Liu, Zhangchen Xu, Guangyao Dou, Xiangchi Yuan, Zhaoxuan Tan, Radha Poovendran, Meng Jiang
Title: Steering Multimodal Large Language Models Decoding for Context-Aware Safety
Abstract:
Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.
Chinese: SafeCoDe框架通过对比解码和标记调节,提升多模态大语言模型的安全性,有效减少不合理拒绝并增强风险识别能力,同时保持模型的实用性。
English: The SafeCoDe framework enhances multimodal large language models' safety by using contrastive decoding and token modulation to reduce unjustified refusals and improve risk detection without compromising helpfulness.

Authors:Shuning Zhang, Hong Jia, Simin Li, Ting Dang, Yongquan `Owen' Hu, Xin Yi, Hewu Li
Title: Position: Human-Robot Interaction in Embodied Intelligence Demands a Shift From Static Privacy Controls to Dynamic Learning
Abstract:
The reasoning capabilities of embodied agents introduce a critical, under-explored inferential privacy challenge, where the risk of an agent generate sensitive conclusions from ambient data. This capability creates a fundamental tension between an agent's utility and user privacy, rendering traditional static controls ineffective. To address this, this position paper proposes a framework that reframes privacy as a dynamic learning problem grounded in theory of Contextual Integrity (CI). Our approach enables agents to proactively learn and adapt to individual privacy norms through interaction, outlining a research agenda to develop embodied agents that are both capable and function as trustworthy safeguards of user privacy.
中文: 该立场文件针对具身智能体的推理隐私风险,提出了基于情境完整性的动态学习框架,使智能体能够通过交互适应个体隐私规范,在保障效用与隐私之间实现平衡。
English: This position paper addresses the inferential privacy risks of embodied agents by proposing a dynamic learning framework based on Contextual Integrity, enabling agents to adapt to individual privacy norms through interaction while balancing utility and privacy.

Authors:Hanqing Liu, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, Wen Yao
Title: Eva-VLA: Evaluating Vision-Language-Action Models' Robustness Under Real-World Physical Variations
Abstract:
Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework that systematically evaluates the robustness of VLA models by transforming discrete physical variations into continuous optimization problems. However, comprehensively assessing VLA robustness presents two key challenges: (1) how to systematically characterize diverse physical variations encountered in real-world deployments while maintaining evaluation reproducibility, and (2) how to discover worst-case scenarios without prohibitive real-world data collection costs efficiently. To address the first challenge, we decompose real-world variations into three critical domains: object 3D transformations that affect spatial reasoning, illumination variations that challenge visual perception, and adversarial patches that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization framework that transforms discrete physical variations into parameter optimization, enabling systematic exploration of worst-case scenarios. Extensive experiments on state-of-the-art OpenVLA models across multiple benchmarks reveal alarming vulnerabilities: all variation types trigger failure rates exceeding 60%, with object transformations causing up to 97.8% failure in long-horizon tasks. Our findings expose critical gaps between controlled laboratory success and unpredictable deployment readiness, while the Eva-VLA framework provides a practical pathway for hardening VLA-based robotic manipulation models against real-world deployment challenges.
中文: Eva-VLA是首个通过将离散物理变化转化为连续优化问题来系统评估视觉-语言-动作模型鲁棒性的统一框架,实验发现在各类现实场景中模型故障率均超过60%,揭示了从实验室到实际部署存在的严重脆弱性。
English: Eva-VLA is the first unified framework to systematically evaluate the robustness of Vision-Language-Action models by transforming discrete physical variations into continuous optimization problems, revealing alarming vulnerabilities with failure rates exceeding 60% across various real-world scenarios.

Authors:Yingxin Li, Jianbo Zhao, Xueyu Ren, Jie Tang, Wangjie You, Xu Chen, Kan Zhou, Chao Feng, Jiao Ran, Yuan Meng, Zhi Wang
Title: Conf-Profile: A Confidence-Driven Reasoning Paradigm for Label-Free User Profiling
Abstract:
User profiling, as a core technique for user understanding, aims to infer structural attributes from user information. Large Language Models (LLMs) provide a promising avenue for user profiling, yet the progress is hindered by the lack of comprehensive benchmarks. To bridge this gap, we propose ProfileBench, an industrial benchmark derived from a real-world video platform, encompassing heterogeneous user data and a well-structured profiling taxonomy. However, the profiling task remains challenging due to the difficulty of collecting large-scale ground-truth labels, and the heterogeneous and noisy user information can compromise the reliability of LLMs. To approach label-free and reliable user profiling, we propose a Confidence-driven Profile reasoning framework Conf-Profile, featuring a two-stage paradigm. We first synthesize high-quality labels by leveraging advanced LLMs with confidence hints, followed by confidence-weighted voting for accuracy improvement and confidence calibration for a balanced distribution. The multiple profile results, rationales, and confidence scores are aggregated and distilled into a lightweight LLM. We further enhance the reasoning ability via confidence-guided unsupervised reinforcement learning, which exploits confidence for difficulty filtering, quasi-ground truth voting, and reward weighting. Experimental results demonstrate that Conf-Profile delivers substantial performance through the two-stage training, improving F1 by 13.97 on Qwen3-8B.
中文摘要:为解决大语言模型用户画像缺乏基准的问题,ProfileBench作为工业级基准被提出,而Conf-Profile框架通过置信度驱动的两阶段训练——合成高质量标签与无监督强化学习,实现了无需标注的可靠用户画像,性能显著提升。
English Summary: ProfileBench is introduced as a comprehensive industrial benchmark to address the lack of evaluation standards for user profiling with LLMs, while Conf-Profile framework leverages confidence-driven synthesis and reinforcement learning to achieve label-free and reliable profiling with significant performance improvements.

Authors:Jay Patrikar, Apoorva Sharma, Sushant Veer, Boyi Li, Sebastian Scherer, Marco Pavone
Title: The Case for Negative Data: From Crash Reports to Counterfactuals for Reasonable Driving
Abstract:
Learning-based autonomous driving systems are trained mostly on incident-free data, offering little guidance near safety-performance boundaries. Real crash reports contain precisely the contrastive evidence needed, but they are hard to use: narratives are unstructured, third-person, and poorly grounded to sensor views. We address these challenges by normalizing crash narratives to ego-centric language and converting both logs and crashes into a unified scene-action representation suitable for retrieval. At decision time, our system adjudicates proposed actions by retrieving relevant precedents from this unified index; an agentic counterfactual extension proposes plausible alternatives, retrieves for each, and reasons across outcomes before deciding. On a nuScenes benchmark, precedent retrieval substantially improves calibration, with recall on contextually preferred actions rising from 24% to 53%. The counterfactual variant preserves these gains while sharpening decisions near risk.
中文摘要:该系统通过将事故报告和传感器数据转化为统一表征,实现先例检索和反事实推理,从而在风险边界附近提升自动驾驶决策的安全性和精确性。
English Summary: The system improves autonomous driving safety by converting crash reports and sensor data into a unified representation, enabling precedent retrieval and counterfactual reasoning to enhance decision-making near risk boundaries.

Authors:Hanqun Cao, Marcelo D. T. Torres, Jingjie Zhang, Zijun Gao, Fang Wu, Chunbin Gu, Jure Leskovec, Yejin Choi, Cesar de la Fuente-Nunez, Guangyong Chen, Pheng-Ann Heng
Title: A deep reinforcement learning platform for antibiotic discovery
Abstract:
Antimicrobial resistance (AMR) is projected to cause up to 10 million deaths annually by 2050, underscoring the urgent need for new antibiotics. Here we present ApexAmphion, a deep-learning framework for de novo design of antibiotics that couples a 6.4-billion-parameter protein language model with reinforcement learning. The model is first fine-tuned on curated peptide data to capture antimicrobial sequence regularities, then optimised with proximal policy optimization against a composite reward that combines predictions from a learned minimum inhibitory concentration (MIC) classifier with differentiable physicochemical objectives. In vitro evaluation of 100 designed peptides showed low MIC values (nanomolar range in some cases) for all candidates (100% hit rate). Moreover, 99 our of 100 compounds exhibited broad-spectrum antimicrobial activity against at least two clinically relevant bacteria. The lead molecules killed bacteria primarily by potently targeting the cytoplasmic membrane. By unifying generation, scoring and multi-objective optimization with deep reinforcement learning in a single pipeline, our approach rapidly produces diverse, potent candidates, offering a scalable route to peptide antibiotics and a platform for iterative steering toward potency and developability within hours.
中文:ApexAmphion深度学习框架融合蛋白质语言模型与强化学习,成功设计出新型抗生素,实验室测试显示所有候选肽均具抗菌活性且能靶向细胞膜,实现100%命中率。
English: The ApexAmphion deep-learning framework combines a protein language model with reinforcement learning to design novel antibiotics, achieving 100% hit rate with potent broad-spectrum activity through membrane targeting in laboratory tests.

Authors:Nachiket N. Naik, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Title: BULL-ODE: Bullwhip Learning with Neural ODEs and Universal Differential Equations under Stochastic Demand
Abstract:
We study learning of continuous-time inventory dynamics under stochastic demand and quantify when structure helps or hurts forecasting of the bullwhip effect. BULL-ODE compares a fully learned Neural ODE (NODE) that models the entire right-hand side against a physics-informed Universal Differential Equation (UDE) that preserves conservation and order-up-to structure while learning a small residual policy term. Classical supply chain models explain the bullwhip through control/forecasting choices and information sharing, while recent physics-informed and neural differential equation methods blend domain constraints with learned components. It is unclear whether structural bias helps or hinders forecasting under different demand regimes. We address this by using a single-echelon testbed with three demand regimes - AR(1) (autocorrelated), i.i.d. Gaussian, and heavy-tailed lognormal. Training is done on varying fractions of each trajectory, followed by evaluation of multi-step forecasts for inventory I, order rate O, and demand D. Across the structured regimes, UDE consistently generalizes better: with 90% of the training horizon, inventory RMSE drops from 4.92 (NODE) to 0.26 (UDE) under AR(1) and from 5.96 to 0.95 under Gaussian demand. Under heavy-tailed lognormal shocks, the flexibility of NODE is better. These trends persist as train18 ing data shrinks, with NODE exhibiting phase drift in extrapolation while UDE remains stable but underreacts to rare spikes. Our results provide concrete guidance: enforce structure when noise is light-tailed or temporally correlated; relax structure when extreme events dominate. Beyond inventory control, the results offer guidance for hybrid modeling in scientific and engineering systems: enforce known structure when conservation laws and modest noise dominate, and relax structure to capture extremes in settings where rare events drive dynamics.
中文: 本研究评估了结构建模对连续时间库存系统预测精度的影响,发现在结构化需求模式下物理信息通用微分方程优于完全学习的神经微分方程,但在处理由极端事件主导的重尾分布时表现较差。
English: This study evaluates how structural modeling affects forecasting accuracy in continuous-time inventory systems, finding that physics-informed Universal Differential Equations outperform fully learned Neural ODEs under structured demand regimes but underperform when handling heavy-tailed distributions dominated by extreme events.

Authors:Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
Title: Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
Abstract:
Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.
中文: Spiffy是一种推测解码算法,通过自推测草稿生成和优化的草稿图,在保持输出质量的同时将扩散大模型的推理速度提升2.8-3.1倍,实现显著加速效果。
English: Spiffy is a speculative decoding algorithm that accelerates diffusion LLMs by 2.8-3.1× while preserving output quality, using auto-speculative draft generation and optimized draft graphs to achieve significant speedups.

Authors:Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
Title: Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
Abstract:
Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM's distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.
中文: Spiffy是一种推测解码算法,通过自推测草稿生成和优化的草稿图,在保持输出质量的同时将扩散大模型的推理速度提升2.8-3.1倍,实现显著加速效果。
English: Spiffy is a speculative decoding algorithm that accelerates diffusion LLMs by 2.8-3.1× while preserving output quality, using auto-speculative draft generation and optimized draft graphs to achieve significant speedups.

Authors:Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang
Title: 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
Abstract:
Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video compression methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time decoding and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian compression framework that facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and compression-friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time decoding and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro
中文摘要:4DGCPro提出了一种分层4D高斯压缩框架,通过运动感知分组和熵优化训练实现实时移动端解码与渐进式流传输,在率失真性能上超越现有方法。
English Summary: 4DGCPro introduces a hierarchical 4D Gaussian compression framework enabling real-time mobile decoding and progressive streaming through motion-aware grouping and entropy-optimized training, outperforming existing methods in rate-distortion performance.

Authors:Zihan Zheng, Zhenlong Wu, Houqiang Zhong, Yuan Tian, Ning Cao, Lan Xu, Jiangchao Yao, Xiaoyun Zhang, Qiang Hu, Wenjun Zhang
Title: 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming
Abstract:
Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video compression methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time decoding and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian compression framework that facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and compression-friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time decoding and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. Project Page: https://mediax-sjtu.github.io/4DGCPro
中文摘要:4DGCPro提出了一种分层4D高斯压缩框架,通过运动感知分组和熵优化训练实现实时移动端解码与渐进式流传输,在率失真性能上超越现有方法。
English Summary: 4DGCPro introduces a hierarchical 4D Gaussian compression framework enabling real-time mobile decoding and progressive streaming through motion-aware grouping and entropy-optimized training, outperforming existing methods in rate-distortion performance.

Authors:Ying Feng, Hongjie Fang, Yinong He, Jingjing Chen, Chenxi Wang, Zihao He, Ruonan Liu, Cewu Lu
Title: Learning Dexterous Manipulation with Quantized Hand State
Abstract:
Dexterous robotic hands enable robots to perform complex manipulations that require fine-grained control and adaptability. Achieving such manipulation is challenging because the high degrees of freedom tightly couple hand and arm motions, making learning and control difficult. Successful dexterous manipulation relies not only on precise hand motions, but also on accurate spatial positioning of the arm and coordinated arm-hand dynamics. However, most existing visuomotor policies represent arm and hand actions in a single combined space, which often causes high-dimensional hand actions to dominate the coupled action space and compromise arm control. To address this, we propose DQ-RISE, which quantizes hand states to simplify hand motion prediction while preserving essential patterns, and applies a continuous relaxation that allows arm actions to diffuse jointly with these compact hand states. This design enables the policy to learn arm-hand coordination from data while preventing hand actions from overwhelming the action space. Experiments show that DQ-RISE achieves more balanced and efficient learning, paving the way toward structured and generalizable dexterous manipulation. Project website: http://rise-policy.github.io/DQ-RISE/
中文摘要:DQ-RISE通过量化手部状态并应用连续松弛技术,实现了手臂与手部的协调控制,有效防止高维手部动作主导控制空间,从而提升了灵巧操作的平衡学习效率。
English Summary: DQ-RISE introduces a novel approach that quantizes hand states and applies continuous relaxation to enable balanced arm-hand coordination, achieving more efficient learning for dexterous manipulation without letting high-dimensional hand actions dominate the control space.

Authors:Jingjing Chen, Hongjie Fang, Chenxi Wang, Shiquan Wang, Cewu Lu
Title: History-Aware Visuomotor Policy Learning via Point Tracking
Abstract:
Many manipulation tasks require memory beyond the current observation, yet most visuomotor policies rely on the Markov assumption and thus struggle with repeated states or long-horizon dependencies. Existing methods attempt to extend observation horizons but remain insufficient for diverse memory requirements. To this end, we propose an object-centric history representation based on point tracking, which abstracts past observations into a compact and structured form that retains only essential task-relevant information. Tracked points are encoded and aggregated at the object level, yielding a compact history representation that can be seamlessly integrated into various visuomotor policies. Our design provides full history-awareness with high computational efficiency, leading to improved overall task performance and decision accuracy. Through extensive evaluations on diverse manipulation tasks, we show that our method addresses multiple facets of memory requirements - such as task stage identification, spatial memorization, and action counting, as well as longer-term demands like continuous and pre-loaded memory - and consistently outperforms both Markovian baselines and prior history-based approaches. Project website: http://tonyfang.net/history
中文: 本文提出一种基于点跟踪的物体中心历史表征方法,将过去观察抽象为紧凑结构化形式,显著提升了视觉运动策略的记忆处理能力和任务执行效果。
English: This paper introduces an object-centric history representation using point tracking to compactly structure past observations, enhancing visuomotor policies with efficient memory handling and superior performance across diverse manipulation tasks.

Authors:Pengfei Hao, Hongqiu Wang, Shuaibo Li, Zhaohu Xing, Guang Yang, Kaishun Wu, Lei Zhu
Title: Surgical-MambaLLM: Mamba2-enhanced Multimodal Large Language Model for VQLA in Robotic Surgery
Abstract:
In recent years, Visual Question Localized-Answering in robotic surgery (Surgical-VQLA) has gained significant attention for its potential to assist medical students and junior doctors in understanding surgical scenes. Recently, the rapid development of Large Language Models (LLMs) has provided more promising solutions for this task. However, current methods struggle to establish complex dependencies between text and visual details, and have difficulty perceiving the spatial information of surgical scenes. To address these challenges, we propose a novel method, Surgical-MambaLLM, which is the first to combine Mamba2 with LLM in the surgical domain, that leverages Mamba2's ability to effectively capture cross-modal dependencies and perceive spatial information in surgical scenes, thereby enhancing the LLMs' understanding of surgical images. Specifically, we propose the Cross-modal Bidirectional Mamba2 Integration (CBMI) module to leverage Mamba2 for effective multimodal fusion, with its cross-modal integration capabilities. Additionally, tailored to the geometric characteristics of surgical scenes, we design the Surgical Instrument Perception (SIP) scanning mode for Mamba2 to scan the surgical images, enhancing the model's spatial understanding of the surgical scene. Extensive experiments demonstrate that our Surgical-MambaLLM model outperforms the state-of-the-art methods on the EndoVis17-VQLA and EndoVis18-VQLA datasets, significantly improving the performance of the Surgical-VQLA task.
Chinese: 提出的Surgical-MambaLLM模型将Mamba2与大型语言模型结合,通过改进多模态融合和手术场景空间感知能力,在Surgical-VQLA任务中实现了最优性能。
English: The proposed Surgical-MambaLLM model integrates Mamba2 with LLMs to enhance multimodal fusion and spatial perception in surgical scenes, achieving state-of-the-art performance on Surgical-VQLA benchmarks.

Authors:Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Paliwal, Weili Nie, Arash Vahdat
Title: Rethinking Molecule Synthesizability with Chain-of-Reaction
Abstract:
A well-known pitfall of molecular generative models is that they are not guaranteed to generate synthesizable molecules. There have been considerable attempts to address this problem, but given the exponentially large combinatorial space of synthesizable molecules, existing methods have shown limited coverage of the space and poor molecular optimization performance. To tackle these problems, we introduce ReaSyn, a generative framework for synthesizable projection where the model explores the neighborhood of given molecules in the synthesizable space by generating pathways that result in synthesizable analogs. To fully utilize the chemical knowledge contained in the synthetic pathways, we propose a novel perspective that views synthetic pathways akin to reasoning paths in large language models (LLMs). Specifically, inspired by chain-of-thought (CoT) reasoning in LLMs, we introduce the chain-of-reaction (CoR) notation that explicitly states reactants, reaction types, and intermediate products for each step in a pathway. With the CoR notation, ReaSyn can get dense supervision in every reaction step to explicitly learn chemical reaction rules during supervised training and perform step-by-step reasoning. In addition, to further enhance the reasoning capability of ReaSyn, we propose reinforcement learning (RL)-based finetuning and goal-directed test-time compute scaling tailored for synthesizable projection. ReaSyn achieves the highest reconstruction rate and pathway diversity in synthesizable molecule reconstruction and the highest optimization performance in synthesizable goal-directed molecular optimization, and significantly outperforms previous synthesizable projection methods in synthesizable hit expansion. These results highlight ReaSyn's superior ability to navigate combinatorially-large synthesizable chemical space.
中文: ReaSyn作为一种生成框架,通过将合成路径类比为推理链来提升可合成分子的生成能力,结合监督训练和强化学习,在分子重构与优化中实现了卓越性能。
English: ReaSyn is a generative framework that enhances synthesizable molecule generation by modeling synthetic pathways as reasoning chains, achieving superior performance in molecular reconstruction and optimization through supervised training and reinforcement learning.

Authors:Shalini Dangi, Surya Karthikeya Mullapudi, Chandravardhan Singh Raghaw, Shahid Shafi Dar, Mohammad Zia Ur Rehman, Nagendra Kumar
Title: A multi-temporal multi-spectral attention-augmented deep convolution neural network with contrastive learning for crop yield prediction
Abstract:
Precise yield prediction is essential for agricultural sustainability and food security. However, climate change complicates accurate yield prediction by affecting major factors such as weather conditions, soil fertility, and farm management systems. Advances in technology have played an essential role in overcoming these challenges by leveraging satellite monitoring and data analysis for precise yield estimation. Current methods rely on spatio-temporal data for predicting crop yield, but they often struggle with multi-spectral data, which is crucial for evaluating crop health and growth patterns. To resolve this challenge, we propose a novel Multi-Temporal Multi-Spectral Yield Prediction Network, MTMS-YieldNet, that integrates spectral data with spatio-temporal information to effectively capture the correlations and dependencies between them. While existing methods that rely on pre-trained models trained on general visual data, MTMS-YieldNet utilizes contrastive learning for feature discrimination during pre-training, focusing on capturing spatial-spectral patterns and spatio-temporal dependencies from remote sensing data. Both quantitative and qualitative assessments highlight the excellence of the proposed MTMS-YieldNet over seven existing state-of-the-art methods. MTMS-YieldNet achieves MAPE scores of 0.336 on Sentinel-1, 0.353 on Landsat-8, and an outstanding 0.331 on Sentinel-2, demonstrating effective yield prediction performance across diverse climatic and seasonal conditions. The outstanding performance of MTMS-YieldNet improves yield predictions and provides valuable insights that can assist farmers in making better decisions, potentially improving crop yields.
中文: 提出的MTMS-YieldNet模型通过对比学习整合多光谱和时空数据,在多种气候条件下实现了卓越的作物产量预测精度,以最低0.331的MAPE评分超越现有方法。
English: The proposed MTMS-YieldNet model integrates multi-spectral and spatio-temporal data using contrastive learning to achieve superior crop yield prediction accuracy across various climatic conditions, outperforming existing methods with MAPE scores as low as 0.331.

Authors:Chen Wang, Zeyuan Ma, Zhiguang Cao, Yue-Jiao Gong
Title: Instance Generation for Meta-Black-Box Optimization through Latent Space Reverse Engineering
Abstract:
To relieve intensive human-expertise required to design optimization algorithms, recent Meta-Black-Box Optimization (MetaBBO) researches leverage generalization strength of meta-learning to train neural network-based algorithm design policies over a predefined training problem set, which automates the adaptability of the low-level optimizers on unseen problem instances. Currently, a common training problem set choice in existing MetaBBOs is well-known benchmark suites CoCo-BBOB. Although such choice facilitates the MetaBBO's development, problem instances in CoCo-BBOB are more or less limited in diversity, raising the risk of overfitting of MetaBBOs, which might further results in poor generalization. In this paper, we propose an instance generation approach, termed as \textbf{LSRE}, which could generate diverse training problem instances for MetaBBOs to learn more generalizable policies. LSRE first trains an autoencoder which maps high-dimensional problem features into a 2-dimensional latent space. Uniform-grid sampling in this latent space leads to hidden representations of problem instances with sufficient diversity. By leveraging a genetic-programming approach to search function formulas with minimal L2-distance to these hidden representations, LSRE reverse engineers a diversified problem set, termed as \textbf{Diverse-BBO}. We validate the effectiveness of LSRE by training various MetaBBOs on Diverse-BBO and observe their generalization performances on either synthetic or realistic scenarios. Extensive experimental results underscore the superiority of Diverse-BBO to existing training set choices in MetaBBOs. Further ablation studies not only demonstrate the effectiveness of design choices in LSRE, but also reveal interesting insights on instance diversity and MetaBBO's generalization.
中文摘要:本文提出LSRE方法,通过将问题特征映射到潜在空间并利用遗传编程反向生成多样化问题集,为元黑盒优化提供多样化的训练实例,有效防止过拟合并提升算法泛化能力。
English Summary: The paper introduces LSRE, an instance generation method that creates diverse training problems for Meta-Black-Box Optimization to prevent overfitting and improve generalization by mapping features into a latent space and reverse-engineering varied problem sets.

Authors:Shuaibo Li, Zhaohu Xing, Hongqiu Wang, Pengfei Hao, Xingyu Li, Zekai Liu, Lei Zhu
Title: Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method
Abstract:
The rapid advancement of generative AI in medical imaging has introduced both significant opportunities and serious challenges, especially the risk that fake medical images could undermine healthcare systems. These synthetic images pose serious risks, such as diagnostic deception, financial fraud, and misinformation. However, research on medical forensics to counter these threats remains limited, and there is a critical lack of comprehensive datasets specifically tailored for this field. Additionally, existing media forensic methods, which are primarily designed for natural or facial images, are inadequate for capturing the distinct characteristics and subtle artifacts of AI-generated medical images. To tackle these challenges, we introduce \textbf{MedForensics}, a large-scale medical forensics dataset encompassing six medical modalities and twelve state-of-the-art medical generative models. We also propose \textbf{DSKI}, a novel \textbf{D}ual-\textbf{S}tage \textbf{K}nowledge \textbf{I}nfusing detector that constructs a vision-language feature space tailored for the detection of AI-generated medical images. DSKI comprises two core components: 1) a cross-domain fine-trace adapter (CDFA) for extracting subtle forgery clues from both spatial and noise domains during training, and 2) a medical forensic retrieval module (MFRM) that boosts detection accuracy through few-shot retrieval during testing. Experimental results demonstrate that DSKI significantly outperforms both existing methods and human experts, achieving superior accuracy across multiple medical modalities.
中文摘要:针对AI生成医学图像检测领域缺乏专业工具的问题,本文提出MedForensics数据集和DSKI双阶段知识注入检测器,该方案通过跨域微迹适配和医学取证检索模块,在多模态医疗图像检测中显著超越现有方法和人类专家。
English Summary: The MedForensics dataset and DSKI detector are introduced to address the critical lack of specialized forensic tools for identifying AI-generated medical images, significantly outperforming existing methods and human experts across multiple medical modalities.

Authors:Anastasiia Belousova, Francesco Marchiori, Mauro Conti
Title: Inference Attacks on Encrypted Online Voting via Traffic Analysis
Abstract:
Online voting enables individuals to participate in elections remotely, offering greater efficiency and accessibility in both governmental and organizational settings. As this method gains popularity, ensuring the security of online voting systems becomes increasingly vital, as the systems supporting it must satisfy a demanding set of security requirements. Most research in this area emphasizes the design and verification of cryptographic protocols to protect voter integrity and system confidentiality. However, other vectors, such as network traffic analysis, remain relatively understudied, even though they may pose significant threats to voter privacy and the overall trustworthiness of the system. In this paper, we examine how adversaries can exploit metadata from encrypted network traffic to uncover sensitive information during online voting. Our analysis reveals that, even without accessing the encrypted content, it is possible to infer critical voter actions, such as whether a person votes, the exact moment a ballot is submitted, and whether the ballot is valid or spoiled. We test these attacks with both rule-based techniques and machine learning methods. We evaluate our attacks on two widely used online voting platforms, one proprietary and one partially open source, achieving classification accuracy as high as 99.5%. These results expose a significant privacy vulnerability that threatens key properties of secure elections, including voter secrecy and protection against coercion or vote-buying. We explore mitigations to our attacks, demonstrating that countermeasures such as payload padding and timestamp equalization can substantially limit their effectiveness.
在线投票提高了参与度和效率,但存在安全风险,攻击者可通过加密网络流量推断敏感投票行为,威胁隐私和选举公正性,尽管填充载荷和均衡时间戳等对策可有效缓解这些威胁。
Online voting enhances participation and efficiency but faces security risks, as adversaries can infer sensitive voter actions from encrypted network traffic, threatening privacy and election integrity despite potential countermeasures like payload padding and timestamp equalization.

Authors:Weitong Wu, Zhaohu Xing, Jing Gong, Qin Peng, Lei Zhu
Title: HybridMamba: A Dual-domain Mamba for 3D Medical Image Segmentation
Abstract:
In the domain of 3D biomedical image segmentation, Mamba exhibits the superior performance for it addresses the limitations in modeling long-range dependencies inherent to CNNs and mitigates the abundant computational overhead associated with Transformer-based frameworks when processing high-resolution medical volumes. However, attaching undue importance to global context modeling may inadvertently compromise critical local structural information, thus leading to boundary ambiguity and regional distortion in segmentation outputs. Therefore, we propose the HybridMamba, an architecture employing dual complementary mechanisms: 1) a feature scanning strategy that progressively integrates representations both axial-traversal and local-adaptive pathways to harmonize the relationship between local and global representations, and 2) a gated module combining spatial-frequency analysis for comprehensive contextual modeling. Besides, we collect a multi-center CT dataset related to lung cancer. Experiments on MRI and CT datasets demonstrate that HybridMamba significantly outperforms the state-of-the-art methods in 3D medical image segmentation.
Chinese: HybridMamba通过结合局部与全局特征建模的双重机制及空间-频率综合分析,显著提升了3D医学图像分割性能,在多种数据集上超越现有最优方法。
English: HybridMamba enhances 3D biomedical image segmentation by integrating dual mechanisms for balanced local-global feature modeling and comprehensive contextual analysis, achieving superior performance on MRI and CT datasets.

Authors:Shiyuan Luo, Runlong Yu, Chonghao Qiu, Rahul Ghosh, Robert Ladwig, Paul C. Hanson, Yiqun Xie, Xiaowei Jia
Title: Learning to Retrieve for Environmental Knowledge Discovery: An Augmentation-Adaptive Self-Supervised Learning Framework
Abstract:
The discovery of environmental knowledge depends on labeled task-specific data, but is often constrained by the high cost of data collection. Existing machine learning approaches usually struggle to generalize in data-sparse or atypical conditions. To this end, we propose an Augmentation-Adaptive Self-Supervised Learning (A$^2$SL) framework, which retrieves relevant observational samples to enhance modeling of the target ecosystem. Specifically, we introduce a multi-level pairwise learning loss to train a scenario encoder that captures varying degrees of similarity among scenarios. These learned similarities drive a retrieval mechanism that supplements a target scenario with relevant data from different locations or time periods. Furthermore, to better handle variable scenarios, particularly under atypical or extreme conditions where traditional models struggle, we design an augmentation-adaptive mechanism that selectively enhances these scenarios through targeted data augmentation. Using freshwater ecosystems as a case study, we evaluate A$^2$SL in modeling water temperature and dissolved oxygen dynamics in real-world lakes. Experimental results show that A$^2$SL significantly improves predictive accuracy and enhances robustness in data-scarce and atypical scenarios. Although this study focuses on freshwater ecosystems, the A$^2$SL framework offers a broadly applicable solution in various scientific domains.
中文摘要:A$^2$SL框架通过检索相关观测数据并采用自适应增强机制,有效提升了在数据稀缺或非典型条件下的环境建模预测精度与鲁棒性,以淡水生态系统研究为例验证了其广泛适用性。
English Summary: The A$^2$SL framework enhances environmental modeling by retrieving relevant observational data and employing adaptive augmentation to improve prediction accuracy and robustness in data-scarce or atypical conditions, as demonstrated in freshwater ecosystem studies.

Authors:Haoye Tian, Chong Wang, BoYang Yang, Lyuye Zhang, Yang Liu
Title: A Taxonomy of Prompt Defects in LLM Systems
Abstract:
Large Language Models (LLMs) have become key components of modern software, with prompts acting as their de-facto programming interface. However, prompt design remains largely empirical and small mistakes can cascade into unreliable, insecure, or inefficient behavior. This paper presents the first systematic survey and taxonomy of prompt defects, recurring ways that prompts fail to elicit their intended behavior from LLMs. We organize defects along six dimensions: (1) Specification and Intent, (2) Input and Content, (3) Structure and Formatting, (4) Context and Memory, (5) Performance and Efficiency, and (6) Maintainability and Engineering. Each dimension is refined into fine-grained subtypes, illustrated with concrete examples and root cause analysis. Grounded in software engineering principles, we show how these defects surface in real development workflows and examine their downstream effects. For every subtype, we distill mitigation strategies that span emerging prompt engineering patterns, automated guardrails, testing harnesses, and evaluation frameworks. We then summarize these strategies in a master taxonomy that links defect, impact, and remedy. We conclude with open research challenges and a call for rigorous engineering-oriented methodologies to ensure that LLM-driven systems are dependable by design.
中文: 本文首次系统性地调查并分类了大语言模型中提示缺陷的六个维度,通过分析其根本原因并提出缓解策略,旨在运用工程方法提升系统的可靠性。
English: This paper presents the first systematic survey categorizing six dimensions of prompt defects in Large Language Models, analyzing their root causes and proposing mitigation strategies to enhance reliability through engineering methodologies.

Authors:Yulun Wu, Guangba Yu, Zhihan Jiang, Yichen Li, Michael R. Lyu
Title: Trace Sampling 2.0: Code Knowledge Enhanced Span-level Sampling for Distributed Tracing
Abstract:
Distributed tracing is an essential diagnostic tool in microservice systems, but the sheer volume of traces places a significant burden on backend storage. A common approach to mitigating this issue is trace sampling, which selectively retains traces based on specific criteria, often preserving only anomalous ones. However, this method frequently discards valuable information, including normal traces that are essential for comparative analysis. To address this limitation, we introduce Trace Sampling 2.0, which operates at the span level while maintaining trace structure consistency. This approach allows for the retention of all traces while significantly reducing storage overhead. Based on this concept, we design and implement Autoscope, a span-level sampling method that leverages static analysis to extract execution logic, ensuring that critical spans are preserved without compromising structural integrity. We evaluated Autoscope on two open-source microservices. Our results show that it reduces trace size by 81.2% while maintaining 98.1% faulty span coverage, outperforming existing trace-level sampling methods. Furthermore, we demonstrate its effectiveness in root cause analysis, achieving an average improvement of 8.3%. These findings indicate that Autoscope can significantly enhance observability and storage efficiency in microservices, offering a robust solution for performance monitoring.
中文: Autoscope采用跨度级采样方法,在保持98.1%故障跨度覆盖的同时将追踪存储减少81.2%,并将根因分析提升8.3%,显著增强了微服务的可观测性。
English: Autoscope introduces a span-level sampling method that reduces trace storage by 81.2% while preserving 98.1% faulty span coverage and improving root cause analysis by 8.3%, enhancing observability in microservices.

Authors:Kevin Halim, Sin G. Teo, Ruitao Feng, Zhenpeng Chen, Yang Gu, Chong Wang, Yang Liu
Title: A Study on Thinking Patterns of Large Reasoning Models in Code Generation
Abstract:
Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi-step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. This paper presents a comprehensive study aimed at investigating and uncovering the reasoning behavior of LRMs during code generation. We prompted several state-of-the-art LRMs of varying sizes with code generation tasks and applied open coding to manually annotate the reasoning traces. From this analysis, we derive a taxonomy of LRM reasoning behaviors, encompassing 15 reasoning actions across four phases. Our empirical study based on the taxonomy reveals a series of findings. First, we identify common reasoning patterns, showing that LRMs generally follow a human-like coding workflow, with more complex tasks eliciting additional actions such as scaffolding, flaw detection, and style checks. Second, we compare reasoning across models, finding that Qwen3 exhibits iterative reasoning while DeepSeek-R1-7B follows a more linear, waterfall-like approach. Third, we analyze the relationship between reasoning and code correctness, showing that actions such as unit test creation and scaffold generation strongly support functional outcomes, with LRMs adapting strategies based on task context. Finally, we evaluate lightweight prompting strategies informed by these findings, demonstrating the potential of context- and reasoning-oriented prompts to improve LRM-generated code. Our results offer insights and practical implications for advancing automatic code generation.
中文: 本文系统分析了大推理模型在代码生成中的推理行为,归纳出四阶段15种推理动作,并验证了基于推理模式的提示策略对提升代码质量的有效性。
English: This paper conducts a comprehensive analysis of large reasoning models' code generation behaviors, identifying 15 reasoning actions across four phases and demonstrating how targeted prompting strategies can enhance output quality.

Authors:Chuyang Zhou, Ziao Ji, Daochang Liu, Dongang Wang, Chenyu Wang, Chang Xu
Title: Rest2Visual: Predicting Visually Evoked fMRI from Resting-State Scans
Abstract:
Understanding how spontaneous brain activity relates to stimulus-driven neural responses is a fundamental challenge in cognitive neuroscience. While task-based functional magnetic resonance imaging (fMRI) captures localized stimulus-evoked brain activation, its acquisition is costly, time-consuming, and difficult to scale across populations. In contrast, resting-state fMRI (rs-fMRI) is task-free and abundant, but lacks direct interpretability. We introduce Rest2Visual, a conditional generative model that predicts visually evoked fMRI (ve-fMRI) from resting-state input and 2D visual stimuli. It follows a volumetric encoder--decoder design, where multiscale 3D features from rs-fMRI are modulated by image embeddings via adaptive normalization, enabling spatially accurate, stimulus-specific activation synthesis. To enable model training, we construct a large-scale triplet dataset from the Natural Scenes Dataset (NSD), aligning each rs-fMRI volume with stimulus images and their corresponding ve-fMRI activation maps. Quantitative evaluation shows that the predicted activations closely match ground truth across standard similarity and representational metrics, and support successful image reconstruction in downstream decoding. Notably, the predicted maps preserve subject-specific structure, demonstrating the model's capacity to generate individualized functional surrogates. Our results provide compelling evidence that individualized spontaneous neural activity can be transformed into stimulus-aligned representations, opening new avenues for scalable, task-free functional brain modeling.
中文摘要:Rest2Visual是一种生成模型,能够通过静息态功能磁共振成像和视觉刺激精确预测视觉诱发的大脑活动,为无需任务实验即可实现可扩展的个体化脑功能建模开辟了新途径。
English Summary: Rest2Visual is a generative model that accurately predicts visually evoked brain activity from resting-state fMRI and visual stimuli, enabling scalable and individualized functional brain modeling without task-based experiments.

Authors:Kit-Wa Sou, Junhao Gong, Shoujie Li, Chuqiao Lyu, Ziwu Song, Shilong Mu, Wenbo Ding
Title: MoiréTac: A Dual-Mode Visuotactile Sensor for Multidimensional Perception Using Moiré Pattern Amplification
Abstract:
Visuotactile sensors typically employ sparse marker arrays that limit spatial resolution and lack clear analytical force-to-image relationships. To solve this problem, we present \textbf{MoiréTac}, a dual-mode sensor that generates dense interference patterns via overlapping micro-gratings within a transparent architecture. When two gratings overlap with misalignment, they create moiré patterns that amplify microscopic deformations. The design preserves optical clarity for vision tasks while producing continuous moiré fields for tactile sensing, enabling simultaneous 6-axis force/torque measurement, contact localization, and visual perception. We combine physics-based features (brightness, phase gradient, orientation, and period) from moiré patterns with deep spatial features. These are mapped to 6-axis force/torque measurements, enabling interpretable regression through end-to-end learning. Experimental results demonstrate three capabilities: force/torque measurement with R^2 > 0.98 across tested axes; sensitivity tuning through geometric parameters (threefold gain adjustment); and vision functionality for object classification despite moiré overlay. Finally, we integrate the sensor into a robotic arm for cap removal with coordinated force and torque control, validating its potential for dexterous manipulation.
中文: MoiréTac 提出了一种双模式传感器,通过重叠微光栅产生密集的莫尔条纹,实现了高精度的六轴力/力矩测量、接触定位和视觉感知功能,并具备可调节的灵敏度。
English: MoiréTac introduces a dual-mode sensor using overlapping micro-gratings to generate dense moiré patterns, enabling simultaneous 6-axis force/torque measurement, contact localization, and visual perception with high accuracy and tunable sensitivity.

Authors:Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang
Title: The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning
Abstract:
We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.6% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.
中文: LightVLA是一种可微分的令牌剪枝框架,通过自适应剪除非关键视觉令牌,在提升任务成功率的同时大幅降低计算开销,从而优化视觉-语言-动作模型的效率与性能。
English: LightVLA is a differentiable token pruning framework that enhances vision-language-action models by adaptively pruning non-essential visual tokens, achieving higher task success rates with significantly reduced computational costs.

Authors:Shuning Zhang, Sixing Tao, Eve He, Yuting Yang, Ying Ma, Ailei Wang, Xin Yi, Hewu Li
Title: Conflect: Designing Reflective Thinking-Based Contextual Privacy Policy for Mobile Applications
Abstract:
Privacy policies are lengthy and complex, leading to user neglect. While contextual privacy policies (CPPs) present information at the point of risk, they may lack engagement and disrupt tasks. We propose Conflect, an interactive CPP for mobile apps, guided by a reflective thinking framework. Through three workshops with experienced designers and researchers, we constructed the design space of reflective thinking-based CPP design, and identified the disconnect between context and action as the most critical problem. Based on participants' feedback, we designed Conflect to use sidebar alerts, allowing users to reflect on contextualized risks and fostering their control. Our system contextually detects privacy risks, extracts policy segments, and automatically generates risk descriptions with 94.0% policy extraction accuracy on CPP4APP dataset and a 4.35s latency. A user study (N=28) demonstrated that Conflect improves user understanding, trust, and satisfaction while lowering cognitive load compared to CPPs, privacy policies and privacy labels.
中文总结:Conflect是一种基于反思思维框架的交互式情境隐私政策系统,通过侧边栏风险提示帮助用户理解情境化隐私风险,在提高用户理解度、信任感和满意度的同时有效降低认知负担。
English Summary: Conflect is an interactive contextual privacy policy system for mobile apps that enhances user understanding, trust, and satisfaction by providing real-time risk alerts and reducing cognitive load through reflective thinking.

Authors:Shuning Zhang, Yutong Jiang, Rongjun Ma, Yuting Yang, Mingyao Xu, Zhixin Huang, Xin Yi, Hewu Li
Title: PrivWeb: Unobtrusive and Content-aware Privacy Protection For Web Agents
Abstract:
While web agents gained popularity by automating web interactions, their requirement for interface access introduces significant privacy risks that are understudied, particularly from users' perspective. Through a formative study (N=15), we found users frequently misunderstand agents' data practices, and desired unobtrusive, transparent data management. To achieve this, we designed and implemented PrivWeb, a trusted add-on on web agents that utilizes a localized LLM to anonymize private information on interfaces according to user preferences. It features privacy categorization schema and adaptive notifications that selectively pauses tasks for user control over information collection for highly sensitive information, while offering non-disruptive options for less sensitive information, minimizing human oversight. The user study (N=14) across travel, information retrieval, shopping, and entertainment tasks compared PrivWeb with baselines without notification and without control for private information access, where PrivWeb reduced perceived privacy risks with no associated increase in cognitive effort, and resulted in higher overall satisfaction.
中文摘要:PrivWeb作为一款浏览器插件,通过本地化LLM根据用户偏好对网页界面上的隐私信息进行匿名化处理,其自适应通知功能在用户研究中被证明能有效降低隐私风险且不增加认知负担。
English Summary: PrivWeb is a browser add-on that uses a localized LLM to anonymize private information on web interfaces according to user preferences, featuring adaptive notifications that reduce privacy risks without increasing cognitive effort, as demonstrated in user studies.

Authors:Hongyuan Zhang, Yuheng Wu, Mingyang Zhao, Zhiwei Chen, Rebecca Li, Fei Zhu, Haohan Zhao, Xiaohua Yuan, Meng Yang, Chunli Qiu, Xiang Cong, Haiyan Chen, Lina Luan, Randolph H. L. Wong, Huai Liao, Colin A Graham, Shi Chang, Guowei Tao, Dong Yi, Zhen Lei, Nassir Navab, Sebastien Ourselin, Jiebo Luo, Hongbin Liu, Gaofeng Meng
Title: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications
Abstract:
Artificial intelligence (AI) that can effectively learn ultrasound representations by integrating multi-source data holds significant promise for advancing clinical care. However, the scarcity of large labeled datasets in real-world clinical environments and the limited generalizability of task-specific models have hindered the development of generalizable clinical AI models for ultrasound applications. In this study, we present EchoCare, a novel ultrasound foundation model for generalist clinical use, developed via self-supervised learning on our curated, publicly available, large-scale dataset EchoCareData. EchoCareData comprises 4.5 million ultrasound images, sourced from over 23 countries across 5 continents and acquired via a diverse range of distinct imaging devices, thus encompassing global cohorts that are multi-center, multi-device, and multi-ethnic. Unlike prior studies that adopt off-the-shelf vision foundation model architectures, we introduce a hierarchical classifier into EchoCare to enable joint learning of pixel-level and representation-level features, capturing both global anatomical contexts and local ultrasound characteristics. With minimal training, EchoCare outperforms state-of-the-art comparison models across 10 representative ultrasound benchmarks of varying diagnostic difficulties, spanning disease diagnosis, lesion segmentation, organ detection, landmark prediction, quantitative regression, imaging enhancement and report generation. The code and pretrained model are publicly released, rendering EchoCare accessible for fine-tuning and local adaptation, supporting extensibility to additional applications. EchoCare provides a fully open and generalizable foundation model to boost the development of AI technologies for diverse clinical ultrasound applications.
中文: EchoCare是一种创新的超声基础模型,通过自监督学习利用全球多样化数据集开发,在多种临床任务中表现卓越,为推进超声人工智能技术提供了一个开放且可扩展的平台。
English: EchoCare is a groundbreaking ultrasound foundation model that uses self-supervised learning on a diverse global dataset to outperform existing models across multiple clinical tasks, offering an open and adaptable tool for advancing AI in ultrasound applications.

Authors:Tasnuva Chowdhury, Tadashi Maeno, Fatih Furkan Akman, Joseph Boudreau, Sankha Dutta, Shengyu Feng, Adolfy Hoisie, Kuan-Chieh Hsu, Raees Khan, Jaehyung Kim, Ozgur O. Kilic, Scott Klasky, Alexei Klimentov, Tatiana Korchuganova, Verena Ingrid Martinez Outschoorn, Paul Nilsson, David K. Park, Norbert Podhorszki, Yihui Ren, John Rembrandt Steele, Frédéric Suter, Sairam Sri Vatsavai, Torre Wenaus, Wei Yang, Yiming Yang, Shinjae Yoo
Title: Machine Learning-Driven Predictive Resource Management in Complex Science Workflows
Abstract:
The collaborative efforts of large communities in science experiments, often comprising thousands of global members, reflect a monumental commitment to exploration and discovery. Recently, advanced and complex data processing has gained increasing importance in science experiments. Data processing workflows typically consist of multiple intricate steps, and the precise specification of resource requirements is crucial for each step to allocate optimal resources for effective processing. Estimating resource requirements in advance is challenging due to a wide range of analysis scenarios, varying skill levels among community members, and the continuously increasing spectrum of computing options. One practical approach to mitigate these challenges involves initially processing a subset of each step to measure precise resource utilization from actual processing profiles before completing the entire step. While this two-staged approach enables processing on optimal resources for most of the workflow, it has drawbacks such as initial inaccuracies leading to potential failures and suboptimal resource usage, along with overhead from waiting for initial processing completion, which is critical for fast-turnaround analyses. In this context, our study introduces a novel pipeline of machine learning models within a comprehensive workflow management system, the Production and Distributed Analysis (PanDA) system. These models employ advanced machine learning techniques to predict key resource requirements, overcoming challenges posed by limited upfront knowledge of characteristics at each step. Accurate forecasts of resource requirements enable informed and proactive decision-making in workflow management, enhancing the efficiency of handling diverse, complex workflows across heterogeneous resources.
中文摘要:本研究在PanDA工作流管理系统中引入了一种机器学习模型管道,旨在预测资源需求,以解决复杂科学数据处理中的挑战,并提高跨异构资源处理多样化工作流的效率。
English Summary: This study introduces a machine learning pipeline within the PanDA workflow management system to predict resource requirements, overcoming challenges in complex scientific data processing and enhancing efficiency across diverse workflows.

Authors:Shuning Zhang, Linzhi Wang, Dai Shi, Yuwei Chuai, Jingruo Chen, Yunyi Chen, Yifan Wang, Yating Wang, Xin Yi, Hewu Li
Title: Commenotes: Synthesizing Organic Comments to Support Community-Based Fact-Checking
Abstract:
Community-based fact-checking is promising to reduce the spread of misleading posts at scale. However, its effectiveness can be undermined by the delays in fact-check delivery. Notably, user-initiated organic comments often contain debunking information and have the potential to help mitigate this limitation. Here, we investigate the feasibility of synthesizing comments to generate timely high-quality fact-checks. To this end, we analyze over 2.2 million replies on X and introduce Commenotes, a two-phase framework that filters and synthesizes comments to facilitate fact-check delivery. Our framework reveals that fact-checking comments appear early and sufficiently: 99.3\% of misleading posts receive debunking comments within the initial two hours since post publication, with synthesized \textit{commenotes} successfully earning user trust for 85.8\% of those posts. Additionally, a user study (N=144) found that the synthesized commenotes were often preferred, with the best-performing model achieving a 70.1\% win rate over human notes and being rated as significantly more helpful.
中文: 社区事实核查可通过合成用户有机评论中的辟谣信息来提升时效性,Commenotes框架显示99.3%的误导性帖子在两小时内获得辟谣评论,且合成笔记因更具帮助性而常被用户青睐。
English: Community-based fact-checking can be enhanced by synthesizing timely debunking comments from organic user replies, with the Commenotes framework demonstrating that 99.3% of misleading posts receive such comments within two hours and synthesized notes are often preferred for their helpfulness.

Authors:Tong Zhou, Ruyi Ding, Gaowen Liu, Charles Fleming, Ramana Rao Kompella, Yunsi Fei, Xiaolin Xu, Shaolei Ren
Title: A Content-dependent Watermark for Safeguarding Image Attribution
Abstract:
The rapid growth of digital and AI-generated images has amplified the need for secure and verifiable methods of image attribution. While digital watermarking offers more robust protection than metadata-based approaches--which can be easily stripped--current watermarking techniques remain vulnerable to forgery, creating risks of misattribution that can damage the reputations of AI model developers and the rights of digital artists. These vulnerabilities arise from two key issues: (1) content-agnostic watermarks, which, once learned or leaked, can be transferred across images to fake attribution, and (2) reliance on detector-based verification, which is unreliable since detectors can be tricked. We present MetaSeal, a novel framework for content-dependent watermarking with cryptographic security guarantees to safeguard image attribution. Our design provides (1) forgery resistance, preventing unauthorized replication and enforcing cryptographic verification; (2) robust, self-contained protection, embedding attribution directly into images while maintaining resilience against benign transformations; and (3) evidence of tampering, making malicious alterations visually detectable. Experiments demonstrate that MetaSeal effectively mitigates forgery attempts and applies to both natural and AI-generated images, establishing a new standard for secure image attribution.
中文: MetaSeal提出了一种内容依赖的水印框架,通过密码学安全机制防止图像归属伪造,为自然图像和AI生成图像提供强保护及篡改证据。
English: MetaSeal introduces a content-dependent watermarking framework with cryptographic security to combat forgery in image attribution, offering robust protection and tamper evidence for both natural and AI-generated images.

Authors:Yuwen Cao, Guijun Liu, Tomoaki Ohtsuki, Howard H. Yang, Tony Q. S. Quek
Title: Distributed Gossip-GAN for Low-overhead CSI Feedback Training in FDD mMIMO-OFDM Systems
Abstract:
The deep autoencoder (DAE) framework has turned out to be efficient in reducing the channel state information (CSI) feedback overhead in massive multiple-input multipleoutput (mMIMO) systems. However, these DAE approaches presented in prior works rely heavily on large-scale data collected through the base station (BS) for model training, thus rendering excessive bandwidth usage and data privacy issues, particularly for mMIMO systems. When considering users' mobility and encountering new channel environments, the existing CSI feedback models may often need to be retrained. Returning back to previous environments, however, will make these models perform poorly and face the risk of catastrophic forgetting. To solve the above challenging problems, we propose a novel gossiping generative adversarial network (Gossip-GAN)-aided CSI feedback training framework. Notably, Gossip-GAN enables the CSI feedback training with low-overhead while preserving users' privacy. Specially, each user collects a small amount of data to train a GAN model. Meanwhile, a fully distributed gossip-learning strategy is exploited to avoid model overfitting, and to accelerate the model training as well. Simulation results demonstrate that Gossip-GAN can i) achieve a similar CSI feedback accuracy as centralized training with real-world datasets, ii) address catastrophic forgetting challenges in mobile scenarios, and iii) greatly reduce the uplink bandwidth usage. Besides, our results show that the proposed approach possesses an inherent robustness.
中文: 提出的Gossip-GAN框架通过分布式流言学习机制,在保护用户隐私的前提下仅需少量数据即可实现高效CSI反馈训练,其性能媲美集中式训练,并能解决灾难性遗忘问题同时显著降低带宽消耗。
English: The proposed Gossip-GAN framework enables efficient and privacy-preserving CSI feedback training in mMIMO systems by using distributed gossip-learning with minimal data, achieving accuracy comparable to centralized methods while mitigating catastrophic forgetting and reducing bandwidth usage.

Authors:Yuwei Chuai, Shuning Zhang, Ziming Wang, Xin Yi, Mohsen Mosleh, Gabriele Lenzini
Title: Request a Note: How the Request Function Shapes X's Community Notes System
Abstract:
X's Community Notes is a crowdsourced fact-checking system. To improve its scalability, X recently introduced "Request Community Note" feature, enabling users to solicit fact-checks from contributors on specific posts. Yet, its implications for the system -- what gets checked, by whom, and with what quality -- remain unclear. Using 98,685 requested posts and their associated notes, we evaluate how requests shape the Community Notes system. We find that contributors prioritize posts with higher misleadingness and from authors with greater misinformation exposure, but neglect political content emphasized by requestors. Selection also diverges along partisan lines: contributors more often annotate posts from Republicans, while requestors surface more from Democrats. Although only 12% of posts receive request-fostered notes from top contributors, these notes are rated as more helpful and less polarized than others, partly reflecting top contributors' selective fact-checking of misleading posts. Our findings highlight both the limitations and promise of requests for scaling high-quality community-based fact-checking.
中文摘要:研究发现,“请求社区注释”功能虽能优先核查误导性内容和惯常传播不实信息的作者,但存在党派选择偏差并忽略政治类请求,不过顶级贡献者最终仍能借此生成质量更高、更少两极分化的注释。
English Summary: The study finds that while the "Request Community Note" feature helps prioritize fact-checking of misleading content from prolific misinformation spreaders, it introduces partisan selection biases and neglects politically charged requests, though it ultimately produces higher-quality and less polarized notes from top contributors.

Authors:Zhiyu Fan, Kirill Vasilevski, Dayi Lin, Boyuan Chen, Yihao Chen, Zhiqing Zhong, Jie M. Zhang, Pinjia He, Ahmed E. Hassan
Title: SWE-Effi: Re-Evaluating Software AI Agent System Effectiveness Under Resource Constraints
Abstract:
The advancement of large language models (LLMs) and code agents has demonstrated significant potential to assist software engineering (SWE) tasks, such as autonomous issue resolution and feature addition. Existing AI for software engineering leaderboards (e.g., SWE-bench) focus solely on solution accuracy, ignoring the crucial factor of effectiveness in a resource-constrained world. This is a universal problem that also exists beyond software engineering tasks: any AI system should be more than correct - it must also be cost-effective. To address this gap, we introduce SWE-Effi, a set of new metrics to re-evaluate AI systems in terms of holistic effectiveness scores. We define effectiveness as the balance between the accuracy of outcome (e.g., issue resolve rate) and the resources consumed (e.g., token and time). In this paper, we specifically focus on the software engineering scenario by re-ranking popular AI systems for issue resolution on a subset of the SWE-bench benchmark using our new multi-dimensional metrics. We found that AI system's effectiveness depends not just on the scaffold itself, but on how well it integrates with the base model, which is key to achieving strong performance in a resource-efficient manner. We also identified systematic challenges such as the "token snowball" effect and, more significantly, a pattern of "expensive failures". In these cases, agents consume excessive resources while stuck on unsolvable tasks - an issue that not only limits practical deployment but also drives up the cost of failed rollouts during RL training. Lastly, we observed a clear trade-off between effectiveness under the token budget and effectiveness under the time budget, which plays a crucial role in managing project budgets and enabling scalable reinforcement learning, where fast responses are essential.
Chinese: 本文提出SWE-Effi指标集,通过平衡准确性与资源效率来评估软件工程中的AI系统,弥补了现有基准仅关注解决方案准确性而忽略成本效益的不足。
English: The paper introduces SWE-Effi, a set of metrics to evaluate AI systems in software engineering by balancing accuracy with resource efficiency, addressing limitations of current benchmarks that overlook cost-effectiveness.

Authors:Duolin Sun, Dan Yang, Yue Shen, Yihan Jiao, Zhehao Tan, Jie Feng, Lianzhen Zhong, Jian Wang, Peng Wei, Jinjie Gu
Title: HANRAG: Heuristic Accurate Noise-resistant Retrieval-Augmented Generation for Multi-hop Question Answering
Abstract:
The Retrieval-Augmented Generation (RAG) approach enhances question-answering systems and dialogue generation tasks by integrating information retrieval (IR) technologies with large language models (LLMs). This strategy, which retrieves information from external knowledge bases to bolster the response capabilities of generative models, has achieved certain successes. However, current RAG methods still face numerous challenges when dealing with multi-hop queries. For instance, some approaches overly rely on iterative retrieval, wasting too many retrieval steps on compound queries. Additionally, using the original complex query for retrieval may fail to capture content relevant to specific sub-queries, resulting in noisy retrieved content. If the noise is not managed, it can lead to the problem of noise accumulation. To address these issues, we introduce HANRAG, a novel heuristic-based framework designed to efficiently tackle problems of varying complexity. Driven by a powerful revelator, HANRAG routes queries, decomposes them into sub-queries, and filters noise from retrieved documents. This enhances the system's adaptability and noise resistance, making it highly capable of handling diverse queries. We compare the proposed framework against other leading industry methods across various benchmarks. The results demonstrate that our framework obtains superior performance in both single-hop and multi-hop question-answering tasks.
中文: HANRAG框架通过高效路由和分解查询并过滤噪声,改进了检索增强生成方法,在单跳和多跳问答任务中均取得了优越性能。
English: The HANRAG framework improves Retrieval-Augmented Generation by efficiently routing and decomposing queries while filtering noise, achieving superior performance in both single-hop and multi-hop question-answering tasks.

Authors:Hengyu Fang, Yijiang Liu, Yuan Du, Li Du, Huanrui Yang
Title: SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
Abstract:
Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5\% average success rate enhancement compared to the original model.
Chinese: SQAP-VLA是一种无需训练的结构化框架,首次在视觉-语言-动作模型中同时实现最先进的量化和令牌剪枝,通过协同设计克服了二者的不兼容性,获得1.93倍加速和最高4.5%的性能提升。
English: SQAP-VLA is a training-free framework that simultaneously applies advanced quantization and token pruning to Vision-Language-Action models, achieving a 1.93× speedup and up to 4.5% performance improvement while overcoming their previous incompatibility.

Authors:Ya-Ting Yang, Quanyan Zhu
Title: Toward a Multi-Echelon Cyber Warfare Theory: A Meta-Game-Theoretic Paradigm for Defense and Dominance
Abstract:
Cyber warfare has become a central element of modern conflict, especially within multi-domain operations. As both a distinct and critical domain, cyber warfare requires integrating defensive and offensive technologies into coherent strategies. While prior research has emphasized isolated tactics or fragmented technologies, a holistic understanding is essential for effective resource deployment and risk mitigation. Game theory offers a unifying framework for this purpose. It not only models attacker-defender interactions but also provides quantitative tools for equilibrium analysis, risk assessment, and strategic reasoning. Integrated with modern AI techniques, game-theoretic models enable the design and optimization of strategies across multiple levels of cyber warfare, from policy and strategy to operations, tactics, and technical implementations. These models capture the paradoxical logic of conflict, where more resources do not always translate into greater advantage, and where nonlinear dynamics govern outcomes. To illustrate the approach, this chapter examines RedCyber, a synthetic cyber conflict, demonstrating how game-theoretic methods capture the interdependencies of cyber operations. The chapter concludes with directions for future research on resilience, cros-echelon planning, and the evolving role of AI in cyber warfare.
中文: 博弈论为网络战提供了统一框架,结合人工智能优化各层级战略,并通过模拟冲突场景验证其应用价值。
English: Game theory provides a unified framework for modeling cyber warfare, integrating AI to optimize strategies across all operational levels and demonstrating its application through a synthetic conflict scenario.

Authors:Lynnette Hui Xian Ng, Bianca N. Y. Kang, Kathleen M. Carley
Title: AuraSight: Generating Realistic Social Media Data
Abstract:
This document details the narrative and technical design behind the process of generating a quasi-realistic set X data for a fictional multi-day pop culture episode (AuraSight). Social media post simulation is essential towards creating realistic training scenarios for understanding emergent network behavior that formed from known sets of agents. Our social media post generation pipeline uses the AESOP-SynSM engine, which employs a hybrid approach of agent-based and generative artificial intelligence techniques. We explicate choices in scenario setup and summarize the fictional groups involved, before moving on to the operationalization of these actors and their interactions within the SynSM engine. We also briefly illustrate some outputs generated and discuss the utility of such simulated data and potential future improvements.
中文: 本文介绍了利用混合人工智能方法为虚构的AuraSight活动生成准真实数据的流程,通过模拟社交媒体帖子来研究网络涌现行为并优化训练场景。
English: This document outlines the creation of quasi-realistic data for the fictional AuraSight event using a hybrid AI approach to simulate social media posts, aiming to study emergent network behaviors and improve training scenarios.

Authors:Garry Yang, Zizhe Chen, Man Hon Wong, Haoyu Lei, Yongqiang Chen, Zhenguo Li, Kaiwen Zhou, James Cheng
Title: MESH -- Understanding Videos Like Human: Measuring Hallucinations in Large Video Models
Abstract:
Large Video Models (LVMs) build on the semantic capabilities of Large Language Models (LLMs) and vision modules by integrating temporal information to better understand dynamic video content. Despite their progress, LVMs are prone to hallucinations-producing inaccurate or irrelevant descriptions. Current benchmarks for video hallucination depend heavily on manual categorization of video content, neglecting the perception-based processes through which humans naturally interpret videos. We introduce MESH, a benchmark designed to evaluate hallucinations in LVMs systematically. MESH uses a Question-Answering framework with binary and multi-choice formats incorporating target and trap instances. It follows a bottom-up approach, evaluating basic objects, coarse-to-fine subject features, and subject-action pairs, aligning with human video understanding. We demonstrate that MESH offers an effective and comprehensive approach for identifying hallucinations in videos. Our evaluations show that while LVMs excel at recognizing basic objects and features, their susceptibility to hallucinations increases markedly when handling fine details or aligning multiple actions involving various subjects in longer videos.
中文: 大型视频模型容易产生幻觉,尤其在处理长视频中的细节或多重动作时,而MESH基准通过自下而上、符合人类认知的方法系统地评估这些问题。
English: Large Video Models (LVMs) are prone to hallucinations, especially with fine details or multiple actions in longer videos, and the MESH benchmark systematically evaluates these issues using a bottom-up, human-aligned approach.

Authors:Shahid Shafi Dar, Bharat Kaurav, Arnav Jain, Chandravardhan Singh Raghaw, Mohammad Zia Ur Rehman, Nagendra Kumar
Title: An Explainable Deep Neural Network with Frequency-Aware Channel and Spatial Refinement for Flood Prediction in Sustainable Cities
Abstract:
In an era of escalating climate change, urban flooding has emerged as a critical challenge for sustainable cities, threatening lives, infrastructure, and ecosystems. Traditional flood detection methods are constrained by their reliance on unimodal data and static rule-based systems, which fail to capture the dynamic, non-linear relationships inherent in flood events. Furthermore, existing attention mechanisms and ensemble learning approaches exhibit limitations in hierarchical refinement, cross-modal feature integration, and adaptability to noisy or unstructured environments, resulting in suboptimal flood classification performance. To address these challenges, we present XFloodNet, a novel framework that redefines urban flood classification through advanced deep-learning techniques. XFloodNet integrates three novel components: (1) a Hierarchical Cross-Modal Gated Attention mechanism that dynamically aligns visual and textual features, enabling precise multi-granularity interactions and resolving contextual ambiguities; (2) a Heterogeneous Convolutional Adaptive Multi-Scale Attention module, which leverages frequency-enhanced channel attention and frequency-modulated spatial attention to extract and prioritize discriminative flood-related features across spectral and spatial domains; and (3) a Cascading Convolutional Transformer Feature Refinement technique that harmonizes hierarchical features through adaptive scaling and cascading operations, ensuring robust and noise-resistant flood detection. We evaluate our proposed method on three benchmark datasets, such as Chennai Floods, Rhine18 Floods, and Harz17 Floods, XFloodNet achieves state-of-the-art F1-scores of 93.33%, 82.24%, and 88.60%, respectively, surpassing existing methods by significant margins.
中文: XFloodNet提出了一种创新的深度学习框架,通过整合分层跨模态注意力、多尺度特征提取和级联变换器技术,在城市洪水分类中实现了突破性性能,在基准数据集上达到最优水平。
English: XFloodNet introduces a novel deep-learning framework that overcomes limitations in urban flood classification by integrating hierarchical cross-modal attention, multi-scale feature extraction, and cascading transformers, achieving state-of-the-art performance on benchmark datasets.

Authors:Yi Xie, Ziyuan Yang, Yongqiang Huang, Yinyu Chen, Lei Zhang, Liang Liu, Yi Zhang
Title: Uncertainty-Driven Hierarchical Sampling for Unbalanced Continual Malware Detection with Time-Series Update-Based Retrieval
Abstract:
Android malware detection continues to face persistent challenges stemming from long-term concept drift and class imbalance, as evolving malicious behaviors and shifting usage patterns dynamically reshape feature distributions. Although continual learning (CL) mitigates drift, existing replay-based methods suffer from inherent bias. Specifically, their reliance on classifier uncertainty for sample selection disproportionately prioritizes the dominant benign class, causing overfitting and reduced generalization to evolving malware. To address these limitations, we propose a novel uncertainty-guided CL framework. First, we introduce a hierarchical balanced sampler that employs a dual-phase uncertainty strategy to dynamically balance benign and malicious samples while simultaneously selecting high-information, high-uncertainty instances within each class. This mechanism ensures class equilibrium across both replay and incremental data, thereby enhancing adaptability to emerging threats. Second, we augment the framework with a vector retrieval mechanism that exploits historical malware embeddings to identify evolved variants via similarity-based retrieval, thereby complementing classifier updates. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods under strict low-label conditions (50 labels per phase). It achieves a true positive rate (TPR) of 92.95\% and a mean accuracy (mACC) of 94.26\%, which validates its efficacy for sustainable Android malware detection.
Chinese: 本研究提出的不确定性引导持续学习框架通过分层平衡采样器和向量检索机制,有效应对安卓恶意软件检测中的概念漂移和类别不平衡问题,在低标签条件下实现了92.95%的检出率和94.26%的平均准确率。
English: The proposed uncertainty-guided continual learning framework addresses concept drift and class imbalance in Android malware detection by employing a hierarchical balanced sampler and vector retrieval mechanism, achieving superior performance with a 92.95% TPR and 94.26% mACC under low-label conditions.

Authors:Zheng Dou, Deqing Wang, Fuzhen Zhuang, Jian Ren, Yanlin Hu
Title: FLeW: Facet-Level and Adaptive Weighted Representation Learning of Scientific Documents
Abstract:
Scientific document representation learning provides powerful embeddings for various tasks, while current methods face challenges across three approaches. 1) Contrastive training with citation-structural signals underutilizes citation information and still generates single-vector representations. 2) Fine-grained representation learning, which generates multiple vectors at the sentence or aspect level, requires costly integration and lacks domain generalization. 3) Task-aware learning depends on manually predefined task categorization, overlooking nuanced task distinctions and requiring extra training data for task-specific modules. To address these problems, we propose a new method that unifies the three approaches for better representations, namely FLeW. Specifically, we introduce a novel triplet sampling method that leverages citation intent and frequency to enhance citation-structural signals for training. Citation intents (background, method, result), aligned with the general structure of scientific writing, facilitate a domain-generalized facet partition for fine-grained representation learning. Then, we adopt a simple weight search to adaptively integrate three facet-level embeddings into a task-specific document embedding without task-aware fine-tuning. Experiments show the applicability and robustness of FLeW across multiple scientific tasks and fields, compared to prior models.
中文: 现有科学文档表征方法面临引用信息利用不足、细粒度方法缺乏领域泛化以及需预设任务分类的挑战,而提出的FLeW方法通过增强引用信号和自适应嵌入集成统一了这些方法,在跨任务和领域应用中展现出优越的适用性与鲁棒性。
English: Current scientific document representation methods face challenges in underutilizing citation information, lacking domain generalization in fine-grained approaches, and requiring predefined task categorization, which the proposed FLeW method addresses by unifying these approaches with enhanced citation signals and adaptive embedding integration for robust performance across tasks and fields.

Authors:Chuang Niu, Ge Wang
Title: Reasoning Language Model for Personalized Lung Cancer Screening
Abstract:
Accurate risk assessment in lung cancer screening is critical for enabling early cancer detection and minimizing unnecessary invasive procedures. The Lung CT Screening Reporting and Data System (Lung-RADS) has been widely used as the standard framework for patient management and follow-up. Nevertheless, Lung-RADS faces trade-offs between sensitivity and specificity, as it stratifies risk solely based on lung nodule characteristics without incorporating various risk factors. Here we propose a reasoning language model (RLM) to integrate radiology findings with longitudinal medical records for individualized lung cancer risk assessment. Through a systematic study including dataset construction and distillation, supervised fine-tuning, reinforcement learning, and comprehensive evaluation, our model makes significant improvements in risk prediction performance on datasets in the national lung screening trial. Notably, RLM can decompose the risk evaluation task into sub-components, analyze the contributions of diverse risk factors, and synthesize them into a final risk score computed using our data-driven system equation. Our approach improves both predictive accuracy and monitorability through the chain of thought reasoning process, thereby facilitating clinical translation into lung cancer screening.
中文: 本研究提出了一种推理语言模型,通过整合放射学发现与纵向医疗记录,以数据驱动的方式提升肺癌风险评估的预测准确性和临床适用性。
English: The study introduces a reasoning language model that enhances lung cancer risk assessment by integrating radiology findings with longitudinal medical records, improving predictive accuracy and clinical applicability through a systematic, data-driven approach.

Authors:Hanna Foerster, Ilia Shumailov, Yiren Zhao, Harsh Chaudhari, Jamie Hayes, Robert Mullins, Yarin Gal
Title: Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated
Abstract:
Early research into data poisoning attacks against Large Language Models (LLMs) demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.
中文摘要:早期研究表明,大型语言模型容易遭受后门攻击,但近期模型引入逐步推理机制后,攻击面扩展至中间思维链,尽管分解式推理毒药攻击可修改推理路径,但可靠激活以改变最终答案却异常困难,这源于模型从思维过程中恢复的能力。
English Summary: Early research showed that backdoors could be easily injected into LLMs, but recent models with step-by-step reasoning introduce new attack vectors, though reliably activating these decomposed poisons to alter final answers is surprisingly difficult due to the models' ability to recover from such attacks.

Authors:Ya-Ting Yang, Quanyan Zhu
Title: Bi-Level Game-Theoretic Planning of Cyber Deception for Cognitive Arbitrage
Abstract:
Cognitive vulnerabilities shape human decision-making and arise primarily from two sources: (1) cognitive capabilities, which include disparities in knowledge, education, expertise, or access to information, and (2) cognitive biases, such as rational inattention, confirmation bias, and base rate neglect, which influence how individuals perceive and process information. Exploiting these vulnerabilities allows an entity with superior cognitive awareness to gain a strategic advantage, a concept referred to as cognitive arbitrage. This paper investigates how to exploit the cognitive vulnerabilities of Advanced Persistent Threat (APT) attackers and proposes cognition-aware defenses that leverage windows of superiority to counteract attacks. Specifically, the proposed bi-level cyber warfare game focuses on "strategic-level" design for defensive deception mechanisms, which then facilitates "operational-level" actions and tactical-level execution of Tactics, Techniques, and Procedures (TTPs). Game-theoretic reasoning and analysis play a significant role in the cross-echelon quantitative modeling and design of cognitive arbitrage strategies. Our numerical results demonstrate that although the defender's initial advantage diminishes over time, strategically timed and deployed deception techniques can turn a negative value for the attacker into a positive one during the planning phase, and achieve at least a 40% improvement in total rewards during execution. This demonstrates that the defender can amplify even small initial advantages, sustain a strategic edge over the attacker, and secure long-term objectives, such as protecting critical assets throughout the attacker's lifecycle.
中文: 本文通过博弈论建模和战略性欺骗手段,研究如何利用高级持续性威胁攻击者的认知脆弱性,证明防御者即使在初始优势减弱时仍能维持战略优势,并实现至少40%的收益提升。
English: This paper explores exploiting cognitive vulnerabilities in APT attackers through game-theoretic modeling and strategic deception, demonstrating that defenders can sustain advantages and improve rewards by at least 40% despite diminishing initial superiority.

Authors:Trixia Simangan, Ahmed Nadeem Abbasi, Yipeng Hu, Shaheer U. Saeed
Title: Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning
Abstract:
Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.
中文摘要:Cryo-RL作为强化学习框架,可自动规划前列腺癌冷冻消融术的冷冻针布局,在临床评估中不仅达到专家水平效果,还大幅缩短了术前规划时间。
English Summary: Cryo-RL is a reinforcement learning framework that automates optimal cryoprobe placement for prostate cancer cryoablation, achieving expert-level performance with significantly reduced planning time in clinical evaluations.

Authors:Eli Borodach, Raj Dandekar, Rajat Dandekar, Sreedath Panat
Title: Decoders Laugh as Loud as Encoders
Abstract:
From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (LLMs) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)
Chinese: 尽管GPT-4o等大型语言模型在幽默理解等任务上表现出与精调编码器相当的性能,但这些模型是否真正理解其输出内容的本质问题——尤其是在微妙主题方面——仍然悬而未决。
English: Recent advances in large language models like GPT-4o have demonstrated performance comparable to fine-tuned encoders in tasks such as humor understanding, yet the fundamental question of whether these models genuinely comprehend nuanced themes remains unresolved.

Authors:Ruohong Yang, Peng Hu, Yunfan Li, Xi Peng
Title: DUDE: Diffusion-Based Unsupervised Cross-Domain Image Retrieval
Abstract:
Unsupervised cross-domain image retrieval (UCIR) aims to retrieve images of the same category across diverse domains without relying on annotations. Existing UCIR methods, which align cross-domain features for the entire image, often struggle with the domain gap, as the object features critical for retrieval are frequently entangled with domain-specific styles. To address this challenge, we propose DUDE, a novel UCIR method building upon feature disentanglement. In brief, DUDE leverages a text-to-image generative model to disentangle object features from domain-specific styles, thus facilitating semantical image retrieval. To further achieve reliable alignment of the disentangled object features, DUDE aligns mutual neighbors from within domains to across domains in a progressive manner. Extensive experiments demonstrate that DUDE achieves state-of-the-art performance across three benchmark datasets over 13 domains. The code will be released.
Chinese: DUDE提出了一种新颖的无监督跨域图像检索方法,通过文本到图像生成模型将对象特征与域特定风格解耦,并以渐进方式跨域对齐,在多个数据集上实现了最先进的性能。
English: DUDE introduces a novel unsupervised cross-domain image retrieval method that disentangles object features from domain-specific styles using a text-to-image generative model and aligns them progressively across domains, achieving state-of-the-art results on multiple datasets.

Authors:Bingxin Zhang, Han Zhang, Kun Yang, Yizhe Zhao, Kezhi Wang
Title: On the Performance Analysis of Pinching-Antenna-Enabled SWIPT Systems
Abstract:
In this paper, we studies the performance of a novel simultaneous wireless information and power transfer (SWIPT) system enabled by a flexible pinching-antenna. To support flexible deployment and optimize energy-rate performance, we propose three practical pinching antenna placement-schemes: the edge deployment scheme (EDS), the center deployment scheme (CDS), and the diagonal deployment scheme (DDS). Moreover, a hybrid time-switching (TS) and power-splitting (PS) protocol is introduced, allowing dynamic adjustment between energy harvesting and information decoding. Under each deployment strategy and the transmission protocol, closed-form expressions for the average harvested energy and average achievable rate of a randomly located user equipment (UE) are derived based on the optimal positioning of the pinching-antenna. Numerical simulations confirm the accuracy of the theoretical analysis and illustrate the trade-off between rate and energy harvesting under different schemes.
中文: 本文研究了一种采用柔性夹持天线的新型SWIPT系统,提出了三种部署方案和混合时间切换-功率分配协议以优化能量速率性能,同时推导了能量收集和可达速率的闭式表达式。
English: This paper investigates a novel SWIPT system using a flexible pinching-antenna, proposing three deployment schemes and a hybrid TS-PS protocol to optimize energy-rate performance while deriving closed-form expressions for harvested energy and achievable rates.

Authors:Ruofan Liang, Kai He, Zan Gojcic, Igor Gilitschenski, Sanja Fidler, Nandita Vijaykumar, Zian Wang
Title: LuxDiT: Lighting Estimation with Video Diffusion Transformer
Abstract:
Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.
中文:LuxDiT是一种新颖的数据驱动方法,通过微调视频扩散变换器,根据视觉输入生成HDR环境光贴图,能有效从间接视觉线索推断光照,并在准确性和细节上超越现有技术。
English: LuxDiT is a novel data-driven method that fine-tunes a video diffusion transformer to generate HDR environment maps from visual inputs, effectively learning to infer lighting from indirect cues and outperforming existing techniques in accuracy and detail.

Authors:Zhengjia Wang, Qiang Sheng, Danding Wang, Beizhe Hu, Juan Cao
Title: Bridging Thoughts and Words: Graph-Based Intent-Semantic Joint Learning for Fake News Detection
Abstract:
Fake news detection is an important and challenging task for defending online information integrity. Existing state-of-the-art approaches typically extract news semantic clues, such as writing patterns that include emotional words, stylistic features, etc. However, detectors tuned solely to such semantic clues can easily fall into surface detection patterns, which can shift rapidly in dynamic environments, leading to limited performance in the evolving news landscape. To address this issue, this paper investigates a novel perspective by incorporating news intent into fake news detection, bridging intents and semantics together. The core insight is that by considering news intents, one can deeply understand the inherent thoughts behind news deception, rather than the surface patterns within words alone. To achieve this goal, we propose Graph-based Intent-Semantic Joint Modeling (InSide) for fake news detection, which models deception clues from both semantic and intent signals via graph-based joint learning. Specifically, InSide reformulates news semantic and intent signals into heterogeneous graph structures, enabling long-range context interaction through entity guidance and capturing both holistic and implementation-level intent via coarse-to-fine intent modeling. To achieve better alignment between semantics and intents, we further develop a dynamic pathway-based graph alignment strategy for effective message passing and aggregation across these signals by establishing a common space. Extensive experiments on four benchmark datasets demonstrate the superiority of the proposed InSide compared to state-of-the-art methods.
中文摘要:本文提出InSide模型,通过图结构联合建模语义与意图信号进行假新闻检测,利用异质图构建和动态对齐策略突破仅依赖语义特征的局限,实现了对新闻欺骗本质的深层理解。
English Summary: This paper introduces InSide, a graph-based model that jointly models semantic and intent signals for fake news detection, addressing limitations of semantic-only approaches by capturing deeper deceptive intents through heterogeneous graph structures and dynamic alignment strategies.

Authors:Jianman Lin, Tianshui Chen, Chunmei Qing, Zhijing Yang, Shuangping Huang, Yuheng Ren, Liang Lin
Title: Neural Scene Designer: Self-Styled Semantic Image Manipulation
Abstract:
Maintaining stylistic consistency is crucial for the cohesion and aesthetic appeal of images, a fundamental requirement in effective image editing and inpainting. However, existing methods primarily focus on the semantic control of generated content, often neglecting the critical task of preserving this consistency. In this work, we introduce the Neural Scene Designer (NSD), a novel framework that enables photo-realistic manipulation of user-specified scene regions while ensuring both semantic alignment with user intent and stylistic consistency with the surrounding environment. NSD leverages an advanced diffusion model, incorporating two parallel cross-attention mechanisms that separately process text and style information to achieve the dual objectives of semantic control and style consistency. To capture fine-grained style representations, we propose the Progressive Self-style Representational Learning (PSRL) module. This module is predicated on the intuitive premise that different regions within a single image share a consistent style, whereas regions from different images exhibit distinct styles. The PSRL module employs a style contrastive loss that encourages high similarity between representations from the same image while enforcing dissimilarity between those from different images. Furthermore, to address the lack of standardized evaluation protocols for this task, we establish a comprehensive benchmark. This benchmark includes competing algorithms, dedicated style-related metrics, and diverse datasets and settings to facilitate fair comparisons. Extensive experiments conducted on our benchmark demonstrate the effectiveness of the proposed framework.
中文: 神经场景设计器(NSD)框架通过双交叉注意力机制和渐进式自风格表征学习模块,实现了语义控制与风格一致的逼真图像编辑,并通过全面基准测试验证了其有效性。
English: The Neural Scene Designer (NSD) framework enables realistic image editing with semantic control and style consistency by using dual cross-attention mechanisms and a Progressive Self-style Representational Learning module, validated through a comprehensive benchmark.

Authors:Che Liu, Zheng Jiang, Chengyu Fang, Heng Guo, Yan-Jie Zhou, Jiaqi Qu, Le Lu, Minfeng Xu
Title: M3Ret: Unleashing Zero-shot Multimodal Medical Image Retrieval via Self-Supervision
Abstract:
Medical image retrieval is essential for clinical decision-making and translational research, relying on discriminative visual representations. Yet, current methods remain fragmented, relying on separate architectures and training strategies for 2D, 3D, and video-based medical data. This modality-specific design hampers scalability and inhibits the development of unified representations. To enable unified learning, we curate a large-scale hybrid-modality dataset comprising 867,653 medical imaging samples, including 2D X-rays and ultrasounds, RGB endoscopy videos, and 3D CT scans. Leveraging this dataset, we train M3Ret, a unified visual encoder without any modality-specific customization. It successfully learns transferable representations using both generative (MAE) and contrastive (SimDINO) self-supervised learning (SSL) paradigms. Our approach sets a new state-of-the-art in zero-shot image-to-image retrieval across all individual modalities, surpassing strong baselines such as DINOv3 and the text-supervised BMC-CLIP. More remarkably, strong cross-modal alignment emerges without paired data, and the model generalizes to unseen MRI tasks, despite never observing MRI during pretraining, demonstrating the generalizability of purely visual self-supervision to unseen modalities. Comprehensive analyses further validate the scalability of our framework across model and data sizes. These findings deliver a promising signal to the medical imaging community, positioning M3Ret as a step toward foundation models for visual SSL in multimodal medical image understanding.
中文: M3Ret通过大规模混合模态医疗数据集训练统一视觉编码器,在二维、三维及视频模态的零样本检索中达到最优性能,并在未配对数据训练的情况下展现出对未见MRI任务的卓越跨模态泛化能力。
English: M3Ret introduces a unified visual encoder trained on a large-scale hybrid-modality medical dataset, achieving state-of-the-art zero-shot retrieval across 2D, 3D, and video modalities while demonstrating exceptional cross-modal generalization to unseen MRI data without paired training.

Authors:Wei Wang, Felix Henry, Junzhe Chen, Dan Zhang, Shiyu Huang, Evgeny Kharlamov, Jie Tang
Title: Can Large Language Models Master Complex Card Games?
Abstract:
Complex games have long been an important benchmark for testing the progress of artificial intelligence algorithms. AlphaGo, AlphaZero, and MuZero have defeated top human players in Go and Chess, garnering widespread societal attention towards artificial intelligence. Concurrently, large language models (LLMs) have exhibited remarkable capabilities across various tasks, raising the question of whether LLMs can achieve similar success in complex games. In this paper, we explore the potential of LLMs in mastering complex card games. We systematically assess the learning capabilities of LLMs across eight diverse card games, evaluating the impact of fine-tuning on high-quality gameplay data, and examining the models' ability to retain general capabilities while mastering these games. Our findings indicate that: (1) LLMs can approach the performance of strong game AIs through supervised fine-tuning on high-quality data, (2) LLMs can master multiple complex card games simultaneously, with performance augmentation for games with similar rules and conflicts for dissimilar ones, and (3) LLMs experience a decline in general capabilities when mastering complex games, but this decline can be mitigated by integrating a certain amount of general instruction data. The evaluation results demonstrate strong learning ability and versatility of LLMs.
中文: 研究表明,大型语言模型通过监督微调能够掌握多种复杂纸牌游戏,其表现可媲美专业游戏AI,尽管通用能力会暂时下降,但通过融入通用指令数据可有效缓解这一问题。
English: This study demonstrates that large language models can master multiple complex card games through supervised fine-tuning, achieving performance comparable to specialized game AIs while experiencing manageable declines in general capabilities that can be mitigated with additional instruction data.

Authors:Jiahao Li, Yang Lu, Yachao Zhang, Fangyong Wang, Yuan Xie, Yanyun Qu
Title: Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation
Abstract:
Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ``agent'' to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.
中文:X-Agent是一种创新的开放词汇语义分割框架,通过引入潜在语义感知代理来协调跨模态注意力机制,在优化潜在语义动态性和增强其可感知性的同时实现了最先进的性能。
English: X-Agent is a novel open-vocabulary semantic segmentation framework that introduces latent semantic-aware agents to optimize cross-modal attention, achieving state-of-the-art performance by enhancing latent semantic dynamics and perceptibility.

Authors:Pei Liu, Qingtian Ning, Xinyan Lu, Haipeng Liu, Weiliang Ma, Dangen She, Peng Jia, Xianpeng Lang, Jun Ma
Title: OmniReason: A Temporal-Guided Vision-Language-Action Framework for Autonomous Driving
Abstract:
Recent advances in vision-language models (VLMs) have demonstrated impressive spatial reasoning capabilities for autonomous driving, yet existing methods predominantly focus on static scene understanding while neglecting the essential temporal dimension of real-world driving scenarios. To address this critical limitation, we propose the OmniReason framework, which establishes robust spatiotemporal reasoning by jointly modeling dynamic 3D environments and their underlying decision-making processes. Our work makes two fundamental advances: (1) We introduce OmniReason-Data, two large-scale vision-language-action (VLA) datasets with dense spatiotemporal annotations and natural language explanations, generated through a novel hallucination-mitigated auto-labeling pipeline that ensures both physical plausibility and temporal coherence; (2) We develop the OmniReason-Agent architecture, which integrates a sparse temporal memory module for persistent scene context modeling and an explanation generator that produces human-interpretable decision rationales, facilitated by our spatiotemporal knowledge distillation approach that effectively captures spatiotemporal causal reasoning patterns. Comprehensive experiments demonstrate state-of-the-art performance, where OmniReason-Agent achieves significant improvements in both open-loop planning tasks and visual question answering (VQA) benchmarks, while establishing new capabilities for interpretable, temporally-aware autonomous vehicles operating in complex, dynamic environments.
中文摘要:OmniReason框架通过建立时空推理模型,结合大规模标注数据集和智能体架构,解决了自动驾驶中缺乏时间维度理解的问题,显著提升了规划能力和决策可解释性。
English Summary: The OmniReason framework addresses the lack of temporal reasoning in autonomous driving by introducing spatiotemporal modeling with large-scale annotated datasets and an agent architecture that enhances planning and interpretability.

Authors:Chen Su, Yuanhe Tian, Yan Song, Yongdong Zhang
Title: Text Reinforcement for Multimodal Time Series Forecasting
Abstract:
Recent studies in time series forecasting (TSF) use multimodal inputs, such as text and historical time series data, to predict future values. These studies mainly focus on developing advanced techniques to integrate textual information with time series data to perform the task and achieve promising results. Meanwhile, these approaches rely on high-quality text and time series inputs, whereas in some cases, the text does not accurately or fully capture the information carried by the historical time series, which leads to unstable performance in multimodal TSF. Therefore, it is necessary to enhance the textual content to improve the performance of multimodal TSF. In this paper, we propose improving multimodal TSF by reinforcing the text modalities. We propose a text reinforcement model (TeR) to generate reinforced text that addresses potential weaknesses in the original text, then apply this reinforced text to support the multimodal TSF model's understanding of the time series, improving TSF performance. To guide the TeR toward producing higher-quality reinforced text, we design a reinforcement learning approach that assigns rewards based on the impact of each reinforced text on the performance of the multimodal TSF model and its relevance to the TSF task. We optimize the TeR accordingly, so as to improve the quality of the generated reinforced text and enhance TSF performance. Extensive experiments on a real-world benchmark dataset covering various domains demonstrate the effectiveness of our approach, which outperforms strong baselines and existing studies on the dataset.
中文: 本文提出一种文本增强模型(TeR),通过强化学习优化生成的文本内容,以提升多模态时间序列预测的准确性和稳定性,实验证明该方法优于现有基准模型。
English: This paper introduces a text reinforcement model (TeR) that enhances multimodal time series forecasting by generating improved text inputs, which are optimized through reinforcement learning to boost prediction accuracy and stability.

Authors:Xiyao Wang, Chunyuan Li, Jianwei Yang, Kai Zhang, Bo Liu, Tianyi Xiong, Furong Huang
Title: LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Abstract:
In vision-language modeling, critic models are typically trained to evaluate outputs -- assigning scalar scores or pairwise preferences -- rather than to generate responses. This separation from policy models, which produce the responses, is so entrenched that critics are rarely considered for direct policy use. In this work, we challenge this convention. We propose to reorganize preference-labeled critic datasets into verifiable training signals and perform reinforcement learning directly on a base generative model, producing LLaVA-Critic-R1, a multimodal critic trained to optimize preference judgments while retaining full generation ability. Surprisingly, LLaVA-Critic-R1 emerges not only as a top-performing critic but also as a competitive policy model -- matching or surpassing specialized reasoning VLMs trained with in-domain data across 26 visual reasoning and understanding benchmarks, with an average gain of +5.7% over its base model (Qwen-2.5-VL-7B). Extending this approach to existing strong reasoning VLMs yields LLaVA-Critic-R1+, which further advances policy performance without sacrificing critic quality, achieving a SoTA performance of 71.9 on MMMU at the 7B scale. Finally, we show that the enhanced critic ability benefits inference: applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks without additional training. Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation, offering a simple path toward scalable, self-improving multimodal systems.
中文: 本研究打破了视觉语言建模中评估模型与生成模型分离的传统范式,通过将偏好标注的评估数据集重组为可验证训练信号,开发出LLaVA-Critic-R1这一统一模型,在保持完整生成能力的同时,在26个视觉推理基准上实现性能突破,证明了评估数据强化训练可创造兼具评判与生成能力的多模态系统。
English: This research challenges the conventional separation between critic and policy models in vision-language modeling by transforming preference-labeled critic datasets into verifiable training signals, resulting in LLaVA-Critic-R1—a unified model that excels in both evaluation and generation while achieving state-of-the-art performance across multiple benchmarks.

Authors:Yuwen Pu, Zhou Feng, Chunyi Zhou, Jiahao Chen, Chunqiang Hu, Haibo Hu, Shouling Ji
Title: FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks
Abstract:
Recently, speech assistant and speech verification have been used in many fields, which brings much benefit and convenience for us. However, when we enjoy these speech applications, our speech may be collected by attackers for speech synthesis. For example, an attacker generates some inappropriate political opinions with the characteristic of the victim's voice by obtaining a piece of the victim's speech, which will greatly influence the victim's reputation. Specifically, with the appearance of some zero-shot voice conversion methods, the cost of speech synthesis attacks has been further reduced, which also brings greater challenges to user voice security and privacy. Some researchers have proposed the corresponding privacy-preserving methods. However, the existing approaches have some non-negligible drawbacks: low transferability and robustness, high computational overhead. These deficiencies seriously limit the existing method deployed in practical scenarios. Therefore, in this paper, we propose a lightweight, robust, plug-and-play privacy preservation method against speech synthesis attacks in a black-box setting. Our method generates and adds a frequency-domain perturbation to the original speech to achieve privacy protection and high speech quality. Then, we present a data augmentation strategy and noise smoothing mechanism to improve the robustness of the proposed method. Besides, to reduce the user's defense overhead, we also propose a novel identity-wise protection mechanism. It can generate a universal perturbation for one speaker and support privacy preservation for speech of any length. Finally, we conduct extensive experiments on 5 speech synthesis models, 5 speech verification models, 1 speech recognition model, and 2 datasets. The experimental results demonstrate that our method has satisfying privacy-preserving performance, high speech quality, and utility.
中文摘要:本文提出了一种轻量级即插即用的语音隐私保护方法,通过添加频域扰动有效防止语音被非法合成,同时利用数据增强和噪声平滑机制保障语音质量与实用性。
English Summary: This paper introduces a lightweight, plug-and-play privacy protection method that adds frequency-domain perturbations to speech, effectively preventing unauthorized synthesis while maintaining speech quality and utility through robust data augmentation and noise smoothing mechanisms.

Authors:Yen-Che Chien, Kuang-Da Wang, Wei-Yao Wang, Wen-Chih Peng
Title: NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks
Abstract:
Recent advances in autonomous digital agents from industry (e.g., Manus AI and Gemini's research mode) highlight potential for structured tasks by autonomous decision-making and task decomposition; however, it remains unclear to what extent the agent-based systems can improve multimodal web data productivity. We study this in the realm of journalism, which requires iterative planning, interpretation, and contextual reasoning from multimodal raw contents to form a well structured news. We introduce NEWSAGENT, a benchmark for evaluating how agents can automatically search available raw contents, select desired information, and edit and rephrase to form a news article by accessing core journalistic functions. Given a writing instruction and firsthand data as how a journalist initiates a news draft, agents are tasked to identify narrative perspectives, issue keyword-based queries, retrieve historical background, and generate complete articles. Unlike typical summarization or retrieval tasks, essential context is not directly available and must be actively discovered, reflecting the information gaps faced in real-world news writing. NEWSAGENT includes 6k human-verified examples derived from real news, with multimodal contents converted to text for broad model compatibility. We evaluate open- and closed-sourced LLMs with commonly-used agentic frameworks on NEWSAGENT, which shows that agents are capable of retrieving relevant facts but struggling with planning and narrative integration. We believe that NEWSAGENT serves a realistic testbed for iterating and evaluating agent capabilities in terms of multimodal web data manipulation to real-world productivity.
Chinese Summary: 近期行业在自主数字代理方面的进展显示出处理结构化任务的潜力,但其在提升多模态网络数据生产力方面的效果尚不明确,因此引入NEWSAGENT基准来评估代理在新闻功能(如内容检索和文章生成)中的能力。
English Summary: Recent industry developments in autonomous digital agents show promise for structured tasks, but their effectiveness in enhancing multimodal web data productivity remains uncertain, leading to the creation of the NEWSAGENT benchmark to evaluate agents' abilities in journalistic functions like content retrieval and article generation.

Authors:Ming Ying, Xiaoming Chen, Qiao Qi, Zhaoyang Zhang
Title: QoS-Driven Satellite Constellation Design for LEO Satellite Internet of Things
Abstract:
Low Earth orbit (LEO) satellite Internet of Things (IoT) has been identified as one of the important components of the sixth-generation (6G) non-terrestrial networks (NTN) to provide ubiquitous connectivity. Due to the low orbit altitude and high mobility, a massive number of satellites are required to form a global continuous coverage constellation, leading to a high construction cost. To this end, this paper proposes a LEO satellite IoT constellation design algorithm with the goal of minimizing the total cost while satisfying quality of service (QoS) requirements in terms of coverage ratio and communication quality. Specifically, with a novel fitness function and efficient algorithm's operators, the proposed algorithm converges more quickly and achieves lower constellation construction cost compared to baseline algorithms under the same QoS requirements. Theoretical analysis proves the global and fast convergence of the proposed algorithm due to a novel fitness function. Finally, extensive simulation results confirm the effectiveness of the proposed algorithm in LEO satellite IoT constellation design.
中文: 本文提出了一种低地球轨道卫星物联网星座设计算法,通过创新的适应度函数在满足服务质量要求的同时,以更快收敛速度和更低成本实现星座构建。
English: This paper introduces a cost-minimizing LEO satellite IoT constellation design algorithm that ensures QoS requirements through a novel fitness function, achieving faster convergence and lower costs than baseline methods.

Authors:Akshay K. Jagadish, Mirko Thalmann, Julian Coda-Forno, Marcel Binz, Eric Schulz
Title: Meta-learning ecological priors from large language models explains human learning and decision making
Abstract:
Human cognition is profoundly shaped by the environments in which it unfolds. Yet, it remains an open question whether learning and decision making can be explained as a principled adaptation to the statistical structure of real-world tasks. We introduce ecologically rational analysis, a computational framework that unifies the normative foundations of rational analysis with ecological grounding. Leveraging large language models to generate ecologically valid cognitive tasks at scale, and using meta-learning to derive rational models optimized for these environments, we develop a new class of learning algorithms: Ecologically Rational Meta-learned Inference (ERMI). ERMI internalizes the statistical regularities of naturalistic problem spaces and adapts flexibly to novel situations, without requiring hand-crafted heuristics or explicit parameter updates. We show that ERMI captures human behavior across 15 experiments spanning function learning, category learning, and decision making, outperforming several established cognitive models in trial-by-trial prediction. Our results suggest that much of human cognition may reflect adaptive alignment to the ecological structure of the problems we encounter in everyday life.
中文: 本研究提出的生态理性元学习推理框架(ERMI)通过大语言模型构建生态效度任务,利用元学习推导适应环境的最优模型,能够内化现实问题的统计规律并灵活适应新情境,在多类认知任务中优于传统模型。
English: The study introduces Ecologically Rational Meta-learned Inference (ERMI), a framework that uses large language models and meta-learning to develop adaptive algorithms which capture human cognition across various tasks by aligning with ecological problem structures.

Authors:Litao Yan, Andrew Head, Ken Milne, Vu Le, Sumit Gulwani, Chris Parnin, Emerson Murphy-Hill
Title: The Invisible Mentor: Inferring User Actions from Screen Recordings to Recommend Better Workflows
Abstract:
Many users struggle to notice when a more efficient workflow exists in feature-rich tools like Excel. Existing AI assistants offer help only after users describe their goals or problems, which can be effortful and imprecise. We present InvisibleMentor, a system that turns screen recordings of task completion into vision-grounded reflections on tasks. It detects issues such as repetitive edits and recommends more efficient alternatives based on observed behavior. Unlike prior systems that rely on logs, APIs, or user prompts, InvisibleMentor operates directly on screen recordings. It uses a two-stage pipeline: a vision-language model reconstructs actions and context, and a language model generates structured, high-fidelity suggestions. In evaluation, InvisibleMentor accurately identified inefficient workflows, and participants found its suggestions more actionable, tailored, and more helpful for learning and improvement compared to a prompt-based spreadsheet assistant.
中文: InvisibleMentor系统通过分析屏幕录像自动检测低效工作流程(如重复编辑),无需用户输入即可提供针对性效率建议,相比基于提示的助手更具可操作性和帮助性。
English: InvisibleMentor is a system that analyzes screen recordings to automatically detect inefficient workflows, such as repetitive edits, and provides tailored efficiency recommendations without requiring user input, proving more actionable and helpful than prompt-based assistants.

Authors:Ruixiao Dong, Zhendong Wang, Keli Liu, Li Li, Ying Chen, Kai Li, Daowen Li, Houqiang Li
Title: EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model
Abstract:
Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject's high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject's abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.
中文: EchoGen通过双路径注入策略,在视觉自回归模型上实现主体驱动生成,以更快的采样速度达到了与扩散方法相媲美的高保真度和图像质量。
English: EchoGen introduces a dual-path injection strategy for subject-driven generation using Visual Auto-Regressive models, achieving high fidelity and quality comparable to diffusion methods with faster sampling speeds.

Authors:Zhe Li, Zhiwei Lin, Yongtao Wang
Title: CoLLM-NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search
Abstract:
The integration of Large Language Models (LLMs) with Neural Architecture Search (NAS) has introduced new possibilities for automating the design of neural architectures. However, most existing methods face critical limitations, including architectural invalidity, computational inefficiency, and inferior performance compared to traditional NAS. In this work, we present Collaborative LLM-based NAS (CoLLM-NAS), a two-stage NAS framework with knowledge-guided search driven by two complementary LLMs. Specifically, we propose a Navigator LLM to guide search direction and a Generator LLM to synthesize high-quality candidates, with a dedicated Coordinator module to manage their interaction. CoLLM-NAS efficiently guides the search process by combining LLMs' inherent knowledge of structured neural architectures with progressive knowledge from iterative feedback and historical trajectory. Experimental results on ImageNet and NAS-Bench-201 show that CoLLM-NAS surpasses existing NAS methods and conventional search algorithms, achieving new state-of-the-art results. Furthermore, CoLLM-NAS consistently enhances the performance and efficiency of various two-stage NAS methods (e.g., OFA, SPOS, and AutoFormer) across diverse search spaces (e.g., MobileNet, ShuffleNet, and AutoFormer), demonstrating its excellent generalization.
中文摘要:CoLLM-NAS提出了一种协作式双大语言模型框架,通过导航器引导搜索方向与生成器合成优质架构的协同机制,有效解决了现有神经架构搜索方法的缺陷,在多个基准测试和搜索空间中实现了最优性能。
English Summary: CoLLM-NAS introduces a collaborative dual-LLM framework that overcomes limitations in neural architecture search by combining navigator-guided search direction with generator-synthesized candidates, achieving state-of-the-art performance across multiple benchmarks and search spaces.

Authors:Si-Cheng Wang, Tian-Yu Xiang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Ao-Qun Jin, Zeng-Guang Hou
Title: VLA Model Post-Training via Action-Chunked PPO and Self Behavior Cloning
Abstract:
Reinforcement learning (RL) is a promising avenue for post-training vision-language-action (VLA) models, but practical deployment is hindered by sparse rewards and unstable training. This work mitigates these challenges by introducing an action chunk based on proximal policy optimization (PPO) with behavior cloning using self-collected demonstrations. Aggregating consecutive actions into chunks improves the temporal consistency of the policy and the density of informative feedback. In addition, an auxiliary behavior cloning loss is applied with a dynamically updated demonstration buffer that continually collects high-quality task trials during training. The relative weight between the action-chunked PPO objective and the self behavior clone auxiliary loss is adapted online to stabilize the post-training process. Experiments on the MetaWorld benchmark indicate improved performance over supervised fine-tuning, achieving a high success rate (0.93) and few steps to success (42.17). These results demonstrate the viability of RL for VLA post-training and help lay the groundwork for downstream VLA applications.
中文: 本研究通过采用动作分块的PPO算法和动态行为克隆,有效提升了视觉-语言-动作模型的强化学习效果,在MetaWorld基准测试中表现卓越。
English: This study enhances reinforcement learning for vision-language-action models by implementing action-chunked PPO with dynamic behavior cloning, achieving superior performance on the MetaWorld benchmark.

Authors:Houyi Qi, Minghui Liwang, Liqun Fu, Xianbin Wang, Huaiyu Dai, Xiaoyu Xia
Title: Oh-Trust: Overbooking and Hybrid Trading-empowered Resource Scheduling with Smart Reputation Update over Dynamic Edge Networks
Abstract:
Incentive-driven computing resource sharing is crucial for meeting the ever-growing demands of emerging mobile applications. Although conventional spot trading offers a solution, it frequently leads to excessive overhead due to the need for real-time trading related interactions. Likewise, traditional futures trading, which depends on historical data, is susceptible to risks from network dynamics. This paper explores a dynamic and uncertain edge network comprising a computing platform, e.g., an edge server, that offers computing services as resource seller, and various types of mobile users with diverse resource demands as buyers, including fixed buyers (FBs) and uncertain occasional buyers (OBs) with fluctuating needs. To facilitate efficient and timely computing services, we propose an overbooking- and hybrid trading-empowered resource scheduling mechanism with reputation update, termed Oh-Trust. Particularly, our Oh-Trust incentivizes FBs to enter futures trading by signing long-term contracts with the seller, while simultaneously attracting OBs to spot trading, enhancing resource utilization and profitability for both parties. Crucially, to adapt to market fluctuations, a smart reputation updating mechanism is integrated, allowing for the timely renewal of long-term contracts to optimize trading performance. Extensive simulations using real-world datasets demonstrate the effectiveness of Oh-Trust across multiple evaluation metrics.
中文: 本文提出Oh-Trust机制,通过融合期货与现货交易及信誉更新,在动态边缘网络中为固定和临时用户提供高效资源调度,显著提升资源利用率和交易收益。
English: This paper introduces Oh-Trust, a hybrid resource trading mechanism that combines futures and spot trading with reputation updates to efficiently serve both fixed and occasional buyers in dynamic edge networks, enhancing resource utilization and profitability.

Authors:Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Junyan Ye, Ke-Yue Zhang, Yue Zhou, Peng Jin, Bin Li, Taiping Yao, Shouhong Ding
Title: Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection
Abstract:
Detecting AI-generated images with multimodal large language models (MLLMs) has gained increasing attention, due to their rich world knowledge, common-sense reasoning, and potential for explainability. However, naively applying those MLLMs for detection often leads to suboptimal performance. We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them. First, they do not really see: existing MLLMs' vision encoders are primarily optimized for semantic-oriented recognition rather than the perception of low-level signals, leaving them insensitive to subtle forgery traces. Without access to reliable perceptual evidence, the model grounds its judgment on incomplete and limited visual observations. Second, existing finetuning data for detection typically uses narrow, instruction-style formats, which diverge sharply from the diverse, heterogeneous distributions seen in pretraining. In the absence of meaningful visual cues, the model therefore exploits these linguistic shortcuts, resulting in catastrophic forgetting of pretrained knowledge (even the basic dialogue capabilities). In response, we advocate for a new paradigm: seeing before reasoning. We propose that MLLMs should first be trained to perceive artifacts-strengthening their artifact-aware visual perception-so that subsequent reasoning is grounded in actual observations. We therefore propose Forensic-Chat, a generalizable, explainable, and still-conversational (for multi-round dialogue) assistant for fake image detection. We also propose ExplainFake-Bench, a benchmark tailored for the evaluation of the MLLM's explainability for image forensics from five key aspects. Extensive experiments show its superiority of generalization and genuinely reliable explainability.
Chinese Summary: 该研究提出“先观察后推理”的新范式,通过增强多模态大语言模型的伪影感知能力开发了可泛化、可解释的 Forensic-Chat 检测系统,并在专门构建的 ExplainFake-Bench 基准测试中展现出优越性能。
English Summary: The study introduces a "seeing before reasoning" paradigm with Forensic-Chat, an MLLM that enhances artifact-aware visual perception to improve fake image detection and explanation, validated by the ExplainFake-Bench benchmark.

Authors:Zenghui Huang, Ting Shu, Zhonglei Wang, Yang Lu, Yan Yan, Wei Zhong, Hanzi Wang
Title: DPSformer: A long-tail-aware model for improving heavy rainfall prediction
Abstract:
Accurate and timely forecasting of heavy rainfall remains a critical challenge for modern society. Precipitation exhibits a highly imbalanced distribution: most observations record no or light rain, while heavy rainfall events are rare. Such an imbalanced distribution obstructs deep learning models from effectively predicting heavy rainfall events. To address this challenge, we treat rainfall forecasting explicitly as a long-tailed learning problem, identifying the insufficient representation of heavy rainfall events as the primary barrier to forecasting accuracy. Therefore, we introduce DPSformer, a long-tail-aware model that enriches representation of heavy rainfall events through a high-resolution branch. For heavy rainfall events $ \geq $ 50 mm/6 h, DPSformer lifts the Critical Success Index (CSI) of a baseline Numerical Weather Prediction (NWP) model from 0.012 to 0.067. For the top 1% coverage of heavy rainfall events, its Fraction Skill Score (FSS) exceeds 0.45, surpassing existing methods. Our work establishes an effective long-tailed paradigm for heavy rainfall prediction, offering a practical tool to enhance early warning systems and mitigate the societal impacts of extreme weather events.
中文: 将强降雨预报视为长尾学习问题,DPSformer模型通过增强强降雨表征能力,显著提升了预测指标,超越了现有方法的性能。
English: Heavy rainfall forecasting is improved by treating it as a long-tailed learning problem, with the DPSformer model enhancing heavy rainfall representation and significantly boosting prediction metrics over baseline methods.

Authors:Yen-Ju Lu, Thomas Thebaud, Laureano Moro-Velazquez, Najim Dehak, Jesus Villalba
Title: Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation
Abstract:
We present Paired by the Teacher (PbT), a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data. In many low-resource natural language generation (NLG) scenarios, practitioners may have only raw outputs, like highlights, recaps, or questions, or only raw inputs, such as articles, dialogues, or paragraphs, but seldom both. This mismatch forces small models to learn from very few examples or rely on costly, broad-scope synthetic examples produced by large LLMs. PbT addresses this by asking a teacher LLM to compress each unpaired example into a concise intermediate representation (IR), and training a student to reconstruct inputs from IRs. This enables outputs to be paired with student-generated inputs, yielding high-quality synthetic data. We evaluate PbT on five benchmarks-document summarization (XSum, CNNDM), dialogue summarization (SAMSum, DialogSum), and question generation (SQuAD)-as well as an unpaired setting on SwitchBoard (paired with DialogSum summaries). An 8B student trained only on PbT data outperforms models trained on 70 B teacher-generated corpora and other unsupervised baselines, coming within 1.2 ROUGE-L of human-annotated pairs and closing 82% of the oracle gap at one-third the annotation cost of direct synthesis. Human evaluation on SwitchBoard further confirms that only PbT produces concise, faithful summaries aligned with the target style, highlighting its advantage of generating in-domain sources that avoid the mismatch, limiting direct synthesis.
中文: PbT是一种师生框架,通过教师大语言模型将未配对样本压缩为中间表示,使学生模型能重构输入并生成无需人工标注的高质量合成数据。
English: PbT is a teacher-student framework that generates high-quality synthetic data by having a teacher LLM compress unpaired examples into intermediate representations, enabling a student model to reconstruct inputs and create accurate input-output pairs without human labels.

Authors:Marco Bronzini, Carlo Nicolini, Bruno Lepri, Jacopo Staiano, Andrea Passerini
Title: Hyperdimensional Probe: Decoding LLM Representations via Vector Symbolic Architectures
Abstract:
Despite their capabilities, Large Language Models (LLMs) remain opaque with limited understanding of their internal representations. Current interpretability methods, such as direct logit attribution (DLA) and sparse autoencoders (SAEs), provide restricted insight due to limitations such as the model's output vocabulary or unclear feature names. This work introduces Hyperdimensional Probe, a novel paradigm for decoding information from the LLM vector space. It combines ideas from symbolic representations and neural probing to project the model's residual stream into interpretable concepts via Vector Symbolic Architectures (VSAs). This probe combines the strengths of SAEs and conventional probes while overcoming their key limitations. We validate our decoding paradigm with controlled input-completion tasks, probing the model's final state before next-token prediction on inputs spanning syntactic pattern recognition, key-value associations, and abstract inference. We further assess it in a question-answering setting, examining the state of the model both before and after text generation. Our experiments show that our probe reliably extracts meaningful concepts across varied LLMs, embedding sizes, and input domains, also helping identify LLM failures. Our work advances information decoding in LLM vector space, enabling extracting more informative, interpretable, and structured features from neural representations.
中文: 本文提出超维探针方法,利用向量符号架构从大语言模型向量空间中解码可解释概念,克服了现有技术的局限,并在多种任务和模型上验证了其有效性。
English: This paper introduces Hyperdimensional Probe, a novel method that leverages Vector Symbolic Architectures to decode interpretable concepts from LLMs' vector space, overcoming limitations of existing techniques and demonstrating effectiveness across diverse tasks and models.

Authors:Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan
Title: GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Abstract:
Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
中文摘要:本研究提出了一种新颖的后训练框架,通过引入任务感知奖励来增强强化学习模型在地球观测任务中的表现,在多个基准测试中显著提升了推理能力、优化稳定性和鲁棒性。
English Summary: This study introduces a novel post-training framework that enhances reinforcement learning models for Earth Observation tasks by incorporating task-aware rewards, improving reasoning capabilities, stability, and robustness across multiple benchmarks.

Authors:Zach Eidex, Mojtaba Safari, Jie Ding, Richard Qiu, Justin Roper, David Yu, Hui-Kuo Shu, Zhen Tian, Hui Mao, Xiaofeng Yang
Title: An Efficient 3D Latent Diffusion Model for T1-contrast Enhanced MRI Generation
Abstract:
Objective: Gadolinium-based contrast agents (GBCAs) are commonly employed with T1w MRI to enhance lesion visualization but are restricted in patients at risk of nephrogenic systemic fibrosis and variations in GBCA administration can introduce imaging inconsistencies. This study develops an efficient 3D deep-learning framework to generate T1-contrast enhanced images (T1C) from pre-contrast multiparametric MRI. Approach: We propose the 3D latent rectified flow (T1C-RFlow) model for generating high-quality T1C images. First, T1w and T2-FLAIR images are input into a pretrained autoencoder to acquire an efficient latent space representation. A rectified flow diffusion model is then trained in this latent space representation. The T1C-RFlow model was trained on a curated dataset comprised of the BraTS 2024 glioma (GLI; 1480 patients), meningioma (MEN; 1141 patients), and metastases (MET; 1475 patients) datasets. Selected patients were split into train (N=2860), validation (N=612), and test (N=614) sets. Results: Both qualitative and quantitative results demonstrate that the T1C-RFlow model outperforms benchmark 3D models (pix2pix, DDPM, Diffusion Transformers (DiT-3D)) trained in the same latent space. T1C-RFlow achieved the following metrics - GLI: NMSE 0.044 +/- 0.047, SSIM 0.935 +/- 0.025; MEN: NMSE 0.046 +/- 0.029, SSIM 0.937 +/- 0.021; MET: NMSE 0.098 +/- 0.088, SSIM 0.905 +/- 0.082. T1C-RFlow had the best tumor reconstruction performance and significantly faster denoising times (6.9 s/volume, 200 steps) than conventional DDPM models in both latent space (37.7s, 1000 steps) and patch-based in image space (4.3 hr/volume). Significance: Our proposed method generates synthetic T1C images that closely resemble ground truth T1C in much less time than previous diffusion models. Further development may permit a practical method for contrast-agent-free MRI for brain tumors.
中文: 本研究开发的T1C-RFlow三维深度学习框架能通过非增强MRI生成合成对比增强图像,在显著提升生成效率的同时保持了与真实影像的高度一致性,为脑肿瘤患者实现无造影剂MRI检查提供了潜在解决方案。
English: This study introduces a 3D deep-learning framework called T1C-RFlow that efficiently generates synthetic contrast-enhanced MRI images from pre-contrast scans, demonstrating superior performance and faster processing compared to existing models while potentially enabling gadolinium-free brain tumor imaging.

Authors:Manan Tayal, Aditya Singh, Shishir Kolathaya, Somil Bansal
Title: MAD-PINN: A Decentralized Physics-Informed Machine Learning Framework for Safe and Optimal Multi-Agent Control
Abstract:
Co-optimizing safety and performance in large-scale multi-agent systems remains a fundamental challenge. Existing approaches based on multi-agent reinforcement learning (MARL), safety filtering, or Model Predictive Control (MPC) either lack strict safety guarantees, suffer from conservatism, or fail to scale effectively. We propose MAD-PINN, a decentralized physics-informed machine learning framework for solving the multi-agent state-constrained optimal control problem (MASC-OCP). Our method leverages an epigraph-based reformulation of SC-OCP to simultaneously capture performance and safety, and approximates its solution via a physics-informed neural network. Scalability is achieved by training the SC-OCP value function on reduced-agent systems and deploying them in a decentralized fashion, where each agent relies only on local observations of its neighbours for decision-making. To further enhance safety and efficiency, we introduce an Hamilton-Jacobi (HJ) reachability-based neighbour selection strategy to prioritize safety-critical interactions, and a receding-horizon policy execution scheme that adapts to dynamic interactions while reducing computational burden. Experiments on multi-agent navigation tasks demonstrate that MAD-PINN achieves superior safety-performance trade-offs, maintains scalability as the number of agents grows, and consistently outperforms state-of-the-art baselines.
中文:提出的MAD-PINN框架通过分布式物理信息方法解决多智能体安全与性能的协同优化问题,结合对等重构与神经网络及基于可达性的邻居选择策略,在保证可扩展安全性的同时显著优于现有基准方法。
English: The proposed MAD-PINN framework addresses multi-agent safety and performance challenges through a decentralized physics-informed approach, combining epigraph reformulation with neural networks and reachability-based neighbor selection to achieve scalable safety guarantees while outperforming existing methods.

Authors:Tatsuro Banno, Takehiko Ohkawa, Ruicong Liu, Ryosuke Furuta, Yoichi Sato
Title: AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities
Abstract:
Bimanual human activities inherently involve coordinated movements of both hands and body. However, the impact of this coordination in activity understanding has not been systematically evaluated due to the lack of suitable datasets. Such evaluation demands kinematic-level annotations (e.g., 3D pose) for the hands and body, yet existing 3D activity datasets typically annotate either hand or body pose. Another line of work employs marker-based motion capture to provide full-body pose, but the physical markers introduce visual artifacts, thereby limiting models' generalization to natural, markerless videos. To address these limitations, we present AssemblyHands-X, the first markerless 3D hand-body benchmark for bimanual activities, designed to study the effect of hand-body coordination for action recognition. We begin by constructing a pipeline for 3D pose annotation from synchronized multi-view videos. Our approach combines multi-view triangulation with SMPL-X mesh fitting, yielding reliable 3D registration of hands and upper body. We then validate different input representations (e.g., video, hand pose, body pose, or hand-body pose) across recent action recognition models based on graph convolution or spatio-temporal attention. Our extensive experiments show that pose-based action inference is more efficient and accurate than video baselines. Moreover, joint modeling of hand and body cues improves action recognition over using hands or upper body alone, highlighting the importance of modeling interdependent hand-body dynamics for a holistic understanding of bimanual activities.
中文:AssemblyHands-X是首个针对双手活动的无标记3D手部-身体基准,实验证明联合建模手部和身体线索比单独使用任一数据更能提升动作识别效果。
English: AssemblyHands-X is the first markerless 3D hand-body benchmark for bimanual activities, demonstrating that joint modeling of hand and body cues improves action recognition over using either alone.

Authors:Dianshu Liao, Xin Yin, Shidong Pan, Chao Ni, Zhenchang Xing, Xiaoyu Sun
Title: Navigating the Labyrinth: Path-Sensitive Unit Test Generation with Large Language Models
Abstract:
Unit testing is essential for software quality assurance, yet writing and maintaining tests remains time-consuming and error-prone. To address this challenge, researchers have proposed various techniques for automating unit test generation, including traditional heuristic-based methods and more recent approaches that leverage large language models (LLMs). However, these existing approaches are inherently path-insensitive because they rely on fixed heuristics or limited contextual information and fail to reason about deep control-flow structures. As a result, they often struggle to achieve adequate coverage, particularly for deep or complex execution paths. In this work, we present a path-sensitive framework, JUnitGenie, to fill this gap by combining code knowledge with the semantic capabilities of LLMs in guiding context-aware unit test generation. After extracting code knowledge from Java projects, JUnitGenie distills this knowledge into structured prompts to guide the generation of high-coverage unit tests. We evaluate JUnitGenie on 2,258 complex focal methods from ten real-world Java projects. The results show that JUnitGenie generates valid tests and improves branch and line coverage by 29.60% and 31.00% on average over both heuristic and LLM-based baselines. We further demonstrate that the generated test cases can uncover real-world bugs, which were later confirmed and fixed by developers.
中文: JUnitGenie作为一种路径敏感的框架,通过将代码知识与大语言模型相结合来改进单元测试生成,显著提高了覆盖率,并能有效发现Java项目中的实际错误。
English: JUnitGenie is a path-sensitive framework that enhances unit test generation by integrating code knowledge with large language models, significantly improving coverage and effectively identifying real-world bugs in Java projects.

Authors:Ruibing Hou, Mingshuang Luo, Hongyu Pan, Hong Chang, Shiguang Shan
Title: MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing
Abstract:
This paper proposes MotionVerse, a unified framework that harnesses the capabilities of Large Language Models (LLMs) to comprehend, generate, and edit human motion in both single-person and multi-person scenarios. To efficiently represent motion data, we employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens. Furthermore, we introduce a \textit{Delay Parallel} Modeling strategy, which temporally staggers the encoding of residual token streams. This design enables LLMs to effectively capture inter-stream dependencies while maintaining computational efficiency comparable to single-stream modeling. Moreover, to alleviate modality interference between motion and language, we design a \textit{dual-tower architecture} with modality-specific parameters, ensuring stable integration of motion information for both comprehension and generation tasks. Comprehensive ablation studies demonstrate the effectiveness of each component in MotionVerse, and extensive experiments showcase its superior performance across a wide range of motion-relevant tasks.
中文: MotionVerse是一个统一框架,利用大型语言模型通过残差量化的运动分词器和延迟并行建模策略来理解、生成和编辑人体运动,并采用双塔架构分离运动与语言模态,从而在多种任务中实现卓越性能。
English: MotionVerse is a unified framework using Large Language Models to understand, generate, and edit human motion through a motion tokenizer with residual quantization and a Delay Parallel Modeling strategy, enhanced by a dual-tower architecture to separate motion and language modalities for improved performance across various tasks.

Authors:Takehiko Ohkawa, Jihyun Lee, Shunsuke Saito, Jason Saragih, Fabian Prado, Yichen Xu, Shoou-I Yu, Ryosuke Furuta, Yoichi Sato, Takaaki Shiratori
Title: Generative Modeling of Shape-Dependent Self-Contact Human Poses
Abstract:
One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its relevance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact poses and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving single-view pose estimation in self-contact.
Chinese Summary: 该研究推出了首个具有精确体型标注的大规模自接触数据集Goliath-SC,并开发了基于体型参数条件化的生成模型,有效提升了自接触姿态分布建模和单视角姿态估计的准确性。
English Summary: The study introduces Goliath-SC, the first extensive self-contact dataset with precise body shapes, and develops a generative model that uses shape conditioning to improve self-contact pose distribution modeling and single-view pose estimation.

Authors:Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
Title: Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
Abstract:
The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.
中文:最新的4位浮点格式如MXFP4和NVFP4虽有望加速大语言模型推理,但存在实际性能差距;新提出的MR-GPTQ方法通过定制化优化有效提升精度,实现了性能与准确性的新平衡。
English: Recent 4-bit floating-point formats like MXFP4 and NVFP4 show potential for accelerating LLM inference but face practical limitations, which are addressed by the new MR-GPTQ method that enhances accuracy and performance through tailored optimizations.

Authors:Zhuang Qi, Pan Yu, Lei Meng, Sijin Zhou, Han Yu, Xiaoxiao Li, Xiangxu Meng
Title: Global Prompt Refinement with Non-Interfering Attention Masking for One-Shot Federated Learning
Abstract:
Federated Prompt Learning (FPL) enables communication-efficient adaptation by tuning lightweight prompts on top of frozen pre-trained models. Existing FPL methods typically rely on global information, which is only available after the second training round, to facilitate collaboration among client models. Therefore, they are inherently dependent on multi-round communication to fully exhibit their strengths. Moreover, existing one-shot federated learning methods typically focus on fitting seen tasks, but lack cross-task generalization. To bridge this gap, we propose the Global Prompt Refinement with Non-Interfering Attention Masking (GPR-NIAM) method for one-shot FPL. The core idea is to design a masking mechanism that restricts excessive interaction between the original text embeddings and the learnable prompt embeddings. GPR-NIAM achieves this through the collaboration of two key modules. Firstly, the attention isolation module suppresses attention from the learnable prompt tokens to the original text tokens, and reweights the reverse attention which preserves generalization across tasks. Secondly, the cross-silo collaborative refinement module integrates decentralized visual knowledge into a unified base and calibrates the global prompt through multi-source cross-modal knowledge alignment, further mitigating the inconsistency caused by data heterogeneity. Extensive experiments conducted on ten benchmark datasets under two tasks show that GPR-NIAM outperforms eight state-of-the-art methods in both class-level and domain-level generalization.
中文: GPR-NIAM提出了一种单次联邦提示学习方法,通过注意力掩码限制文本与提示嵌入间的过度交互,在多项基准测试中展现出卓越的跨任务泛化能力,优于现有方法。
English: GPR-NIAM introduces a one-shot federated prompt learning method using attention masking to limit interactions between text and prompt embeddings, enabling superior cross-task generalization and outperforming existing methods on multiple benchmarks.

Authors:Xiaoyun Qiu, Haichao Liu, Yue Pan, Jun Ma, Xinhu Zheng
Title: An Intention-driven Lane Change Framework Considering Heterogeneous Dynamic Cooperation in Mixed-traffic Environment
Abstract:
In mixed-traffic environments, where autonomous vehicles (AVs) interact with diverse human-driven vehicles (HVs), unpredictable intentions and heterogeneous behaviors make safe and efficient lane change maneuvers highly challenging. Existing methods often oversimplify these interactions by assuming uniform patterns. We propose an intention-driven lane change framework that integrates driving-style recognition, cooperation-aware decision-making, and coordinated motion planning. A deep learning classifier trained on the NGSIM dataset identifies human driving styles in real time. A cooperation score with intrinsic and interactive components estimates surrounding drivers' intentions and quantifies their willingness to cooperate with the ego vehicle. Decision-making combines behavior cloning with inverse reinforcement learning to determine whether a lane change should be initiated. For trajectory generation, model predictive control is integrated with IRL-based intention inference to produce collision-free and socially compliant maneuvers. Experiments show that the proposed model achieves 94.2\% accuracy and 94.3\% F1-score, outperforming rule-based and learning-based baselines by 4-15\% in lane change recognition. These results highlight the benefit of modeling inter-driver heterogeneity and demonstrate the potential of the framework to advance context-aware and human-like autonomous driving in complex traffic environments.
中文摘要:该研究提出的意图驱动换道框架通过整合实时驾驶风格识别、合作感知决策和协调运动规划,有效应对混合交通中的异质行为,在换道识别中达到94.2%的准确率,显著优于现有方法。
English Summary: The proposed intention-driven lane change framework integrates real-time driving style recognition, cooperation-aware decision-making, and coordinated motion planning to address heterogeneous behaviors in mixed traffic, achieving superior performance over existing methods with 94.2% accuracy in lane change recognition.

Authors:Quanzhou Li, Zhonghua Wu, Jingbo Wang, Chen Change Loy, Bo Dai
Title: DHAGrasp: Synthesizing Affordance-Aware Dual-Hand Grasps with Text Instructions
Abstract:
Learning to generate dual-hand grasps that respect object semantics is essential for robust hand-object interaction but remains largely underexplored due to dataset scarcity. Existing grasp datasets predominantly focus on single-hand interactions and contain only limited semantic part annotations. To address these challenges, we introduce a pipeline, SymOpt, that constructs a large-scale dual-hand grasp dataset by leveraging existing single-hand datasets and exploiting object and hand symmetries. Building on this, we propose a text-guided dual-hand grasp generator, DHAGrasp, that synthesizes Dual-Hand Affordance-aware Grasps for unseen objects. Our approach incorporates a novel dual-hand affordance representation and follows a two-stage design, which enables effective learning from a small set of segmented training objects while scaling to a much larger pool of unsegmented data. Extensive experiments demonstrate that our method produces diverse and semantically consistent grasps, outperforming strong baselines in both grasp quality and generalization to unseen objects. The project page is at https://quanzhou-li.github.io/DHAGrasp/.
中文摘要:本文提出SymOpt流程构建大规模双手抓取数据集,并开发DHAGrasp文本引导生成器,能为未见物体生成语义一致的双手抓取动作,在抓取质量和泛化能力上均优于现有方法。
English Summary: This paper introduces SymOpt, a pipeline for creating a large-scale dual-hand grasp dataset, and DHAGrasp, a text-guided generator that produces semantically consistent dual-hand grasps for unseen objects, demonstrating superior performance over existing methods.

Authors:Yuki Sakai, Ryosuke Furuta, Juichun Yen, Yoichi Sato
Title: EgoInstruct: An Egocentric Video Dataset of Face-to-face Instructional Interactions with Multi-modal LLM Benchmarking
Abstract:
Analyzing instructional interactions between an instructor and a learner who are co-present in the same physical space is a critical problem for educational support and skill transfer. Yet such face-to-face instructional scenes have not been systematically studied in computer vision. We identify two key reasons: i) the lack of suitable datasets and ii) limited analytical techniques. To address this gap, we present a new egocentric video dataset of face-to-face instruction and provide ground-truth annotations for two fundamental tasks that serve as a first step toward a comprehensive understanding of instructional interactions: procedural step segmentation and conversation-state classification. Using this dataset, we benchmark multimodal large language models (MLLMs) against conventional task-specific models. Since face-to-face instruction involves multiple modalities (speech content and prosody, gaze and body motion, and visual context), effective understanding requires methods that handle verbal and nonverbal communication in an integrated manner. Accordingly, we evaluate recently introduced MLLMs that jointly process images, audio, and text. This evaluation quantifies the extent to which current machine learning models understand face-to-face instructional scenes. In experiments, MLLMs outperform specialized baselines even without task-specific fine-tuning, suggesting their promise for holistic understanding of instructional interactions.
中文摘要:本研究提出了一个新的自我中心视角视频数据集,用于分析面对面教学互动,并证明多模态大语言模型无需特定任务训练即可在理解这些复杂多模态场景方面超越专业基线模型。
English Summary: This study introduces a new egocentric video dataset for analyzing face-to-face instructional interactions and demonstrates that multimodal large language models outperform specialized baselines in understanding these complex multimodal scenes without task-specific training.

Authors:Amine Bechar, Adel Oulefki, Abbes Amira, Fatih Kurogollu, Yassine Himeur
Title: Extracting Actionable Insights from Building Energy Data using Vision LLMs on Wavelet and 3D Recurrence Representations
Abstract:
The analysis of complex building time-series for actionable insights and recommendations remains challenging due to the nonlinear and multi-scale characteristics of energy data. To address this, we propose a framework that fine-tunes visual language large models (VLLMs) on 3D graphical representations of the data. The approach converts 1D time-series into 3D representations using continuous wavelet transforms (CWTs) and recurrence plots (RPs), which capture temporal dynamics and localize frequency anomalies. These 3D encodings enable VLLMs to visually interpret energy-consumption patterns, detect anomalies, and provide recommendations for energy efficiency. We demonstrate the framework on real-world building-energy datasets, where fine-tuned VLLMs successfully monitor building states, identify recurring anomalies, and generate optimization recommendations. Quantitatively, the Idefics-7B VLLM achieves validation losses of 0.0952 with CWTs and 0.1064 with RPs on the University of Sharjah energy dataset, outperforming direct fine-tuning on raw time-series data (0.1176) for anomaly detection. This work bridges time-series analysis and visualization, providing a scalable and interpretable framework for energy analytics.
中文摘要:本研究提出一种框架,通过连续小波变换和递归图将一维时间序列转换为三维表示,对视觉语言大模型进行微调,从而有效检测建筑能耗异常并提供节能建议,实现了时间序列分析与可视化的结合。
English Summary: This study introduces a framework that fine-tunes visual language models on 3D representations of building energy data, enabling effective anomaly detection and energy efficiency recommendations by converting time-series into visual formats through continuous wavelet transforms and recurrence plots.

Authors:Songjun Tu, Qichao Zhang, Jingbo Sun, Yuqian Fu, Linjing Li, Xiangyuan Lan, Dongmei Jiang, Yaowei Wang, Dongbin Zhao
Title: Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
Abstract:
While multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning, their performance is often undermined by a critical vulnerability: perception-induced errors that propagate through the reasoning chain. Current reinforcement learning (RL) fine-tuning methods, while enhancing reasoning abilities, largely fail to address the underlying misalignment between visual grounding and the subsequent reasoning process. To address this challenge, we propose \textbf{Caption-Regularized Policy Optimization (CapPO)}, a novel RL framework that explicitly enforces perceptual consistency during policy optimization. CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, thereby anchoring reasoning to semantically faithful visual content; and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories while suppressing spurious correlations. Extensive experiments on five math-focused and five general reasoning benchmarks demonstrate that CapPO achieves competitive performance, yielding gains of +6.0% accuracy on math-related tasks and +2.4% on general reasoning tasks over the base Qwen2.5-VL-7B model. Moreover, ablation studies further confirm the effectiveness of each component, while error analysis reveals that CapPO significantly reduces perception-related mistakes compared with baselines. Overall, CapPO provides a simple yet effective framework for improving multimodal reasoning.
中文: CapPO是一种新颖的强化学习框架,通过基于描述的规则化方法和自适应优势估计来增强感知一致性,在数学和通用推理任务上实现了显著的准确率提升。
English: CapPO is a novel reinforcement learning framework that enhances multimodal reasoning by enforcing perceptual consistency through caption-based regularization and adaptive advantage estimation, achieving significant accuracy improvements on math and general reasoning tasks.

Authors:Ciyuan Peng, Nguyen Linh Dan Le, Shan Jin, Dexuan Ding, Shuo Yu, Feng Xia
Title: Brain PathoGraph Learning
Abstract:
Brain graph learning has demonstrated significant achievements in the fields of neuroscience and artificial intelligence. However, existing methods struggle to selectively learn disease-related knowledge, leading to heavy parameters and computational costs. This challenge diminishes their efficiency, as well as limits their practicality for real-world clinical applications. To this end, we propose a lightweight Brain PathoGraph Learning (BrainPoG) model that enables efficient brain graph learning by pathological pattern filtering and pathological feature distillation. Specifically, BrainPoG first contains a filter to extract the pathological pattern formulated by highly disease-relevant subgraphs, achieving graph pruning and lesion localization. A PathoGraph is therefore constructed by dropping less disease-relevant subgraphs from the whole brain graph. Afterwards, a pathological feature distillation module is designed to reduce disease-irrelevant noise features and enhance pathological features of each node in the PathoGraph. BrainPoG can exclusively learn informative disease-related knowledge while avoiding less relevant information, achieving efficient brain graph learning. Extensive experiments on four benchmark datasets demonstrate that BrainPoG exhibits superiority in both model performance and computational efficiency across various brain disease detection tasks.
中文:BrainPoG模型通过病理模式过滤和特征蒸馏,有效提升了脑图学习效率,在多种脑疾病检测任务中展现出卓越的性能和计算优势。
English: The BrainPoG model enhances brain graph learning efficiency by filtering pathological patterns and distilling features, achieving superior performance and computational effectiveness in disease detection tasks.

Authors:Yutong Li, Jieyi Zhang, Wenqiang Xu, Tutian Tang, Cewu Lu
Title: FSGlove: An Inertial-Based Hand Tracking System with Shape-Aware Calibration
Abstract:
Accurate hand motion capture (MoCap) is vital for applications in robotics, virtual reality, and biomechanics, yet existing systems face limitations in capturing high-degree-of-freedom (DoF) joint kinematics and personalized hand shape. Commercial gloves offer up to 21 DoFs, which are insufficient for complex manipulations while neglecting shape variations that are critical for contact-rich tasks. We present FSGlove, an inertial-based system that simultaneously tracks up to 48 DoFs and reconstructs personalized hand shapes via DiffHCal, a novel calibration method. Each finger joint and the dorsum are equipped with IMUs, enabling high-resolution motion sensing. DiffHCal integrates with the parametric MANO model through differentiable optimization, resolving joint kinematics, shape parameters, and sensor misalignment during a single streamlined calibration. The system achieves state-of-the-art accuracy, with joint angle errors of less than 2.7 degree, and outperforms commercial alternatives in shape reconstruction and contact fidelity. FSGlove's open-source hardware and software design ensures compatibility with current VR and robotics ecosystems, while its ability to capture subtle motions (e.g., fingertip rubbing) bridges the gap between human dexterity and robotic imitation. Evaluated against Nokov optical MoCap, FSGlove advances hand tracking by unifying the kinematic and contact fidelity. Hardware design, software, and more results are available at: https://sites.google.com/view/fsglove.
Chinese: FSGlove是一种先进的基于惯性的手部动作捕捉系统,通过新型标定方法同时追踪高达48个自由度并重建个性化手部形状,在运动精度和接触保真度方面均超越商业产品,实现了最先进的性能。
English: FSGlove is an advanced inertial-based hand motion capture system that tracks up to 48 degrees of freedom and reconstructs personalized hand shapes using a novel calibration method, achieving high accuracy and outperforming commercial alternatives in both kinematic and contact fidelity.

Authors:Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu
Title: Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution
Abstract:
Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.
中文: 本文提出Recon-Act自演进多智能体框架,通过侦察团队分析错误并生成工具,行动团队利用这些工具执行任务,显著提升了在未知网站和长周期任务中的适应性与解决能力。
English: The paper presents Recon-Act, a self-evolving multi-agent framework that enhances web task execution by employing a reconnaissance team to analyze errors and generate tools, and an action team to utilize these tools for improved adaptability and performance on complex tasks.

Authors:Jianbo Zhao, Taiyu Ban, Xiangjie Li, Xingtai Gui, Hangning Zhou, Lei Liu, Hongwei Zhao, Bin Li
Title: Autoregressive End-to-End Planning with Time-Invariant Spatial Alignment and Multi-Objective Policy Refinement
Abstract:
The inherent sequential modeling capabilities of autoregressive models make them a formidable baseline for end-to-end planning in autonomous driving. Nevertheless, their performance is constrained by a spatio-temporal misalignment, as the planner must condition future actions on past sensory data. This creates an inconsistent worldview, limiting the upper bound of performance for an otherwise powerful approach. To address this, we propose a Time-Invariant Spatial Alignment (TISA) module that learns to project initial environmental features into a consistent ego-centric frame for each future time step, effectively correcting the agent's worldview without explicit future scene prediction. In addition, we employ a kinematic action prediction head (i.e., acceleration and yaw rate) to ensure physically feasible trajectories. Finally, we introduce a multi-objective post-training stage using Direct Preference Optimization (DPO) to move beyond pure imitation. Our approach provides targeted feedback on specific driving behaviors, offering a more fine-grained learning signal than the single, overall objective used in standard DPO. Our model achieves a state-of-the-art 89.8 PDMS on the NAVSIM dataset among autoregressive models. The video document is available at https://tisa-dpo-e2e.github.io/.
中文: 提出的时间不变空间对齐模块通过将环境特征投射到一致的自我中心框架中,解决了自动驾驶中的时空错位问题,同时结合运动学动作预测和直接偏好优化来提升轨迹可行性和行为精细度,实现了最先进的性能。
English: The proposed Time-Invariant Spatial Alignment module addresses spatio-temporal misalignment in autonomous driving by projecting environmental features into consistent ego-centric frames, while kinematic action prediction and Direct Preference Optimization enhance trajectory feasibility and behavioral refinement, achieving state-of-the-art performance.

Authors:Yiyuan Pan, Zhe Liu, Hesheng Wang
Title: Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration
Abstract:
Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.
Chinese Summary: 本研究提出CERMIC框架,通过过滤噪声惊喜信号并基于多智能体上下文动态校准内在好奇心,显著提升了稀疏奖励环境下多智能体探索性能,优于现有最优算法。
English Summary: The study introduces CERMIC, a framework that enhances multi-agent exploration by filtering noisy surprise signals and dynamically calibrating intrinsic curiosity with multi-agent context, significantly outperforming state-of-the-art algorithms in sparse-reward environments.

Authors:Theo Uscidda, Matthew Trager, Michael Kleinman, Aditya Chattopadhyay, Wei Xia, Stefano Soatto
Title: LATTS: Locally Adaptive Test-Time Scaling
Abstract:
One common strategy for improving the performance of Large Language Models (LLMs) on downstream tasks involves using a \emph{verifier model} to either select the best answer from a pool of candidates or to steer the auto-regressive generation process towards better outputs. This class of methods typically results in improved accuracy at the cost of increased computation at test-time, a paradigm known as \emph{test-time scaling}. However, most existing approaches increase computation uniformly across all samples and generation steps, without considering the complexity of individual instances, leading to inefficient resource use. We address this limitation by proposing an approach, called \emph{Locally Adaptive Test-Time Scaling (LATTS)}, that allocates variable compute across generation steps. Specifically, at each generation step, LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process. This criterion effectively adjusts the per-step computational effort based on a precise notion of \emph{local difficulty} derived from the verifier model. Empirical results show that LATTS achieves significantly superior accuracy--compute tradeoffs compared to standard verifier-based methods.
中文: LATTS提出了一种局部自适应测试时间扩展方法,根据局部难度动态调整每个生成步骤的计算量,相比均匀计算方法实现了更优的准确率与计算效率的平衡。
English: LATTS introduces a locally adaptive approach to test-time scaling by dynamically adjusting computational effort per generation step based on local difficulty, achieving superior accuracy-compute tradeoffs compared to uniform methods.

Authors:Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin
Title: Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
Abstract:
We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $σ^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $σ^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.
中文: 本文提出了有效跨度维度(ESD)这一新颖的复杂度度量方法,该指标综合信号、谱和噪声因素,证明了其决定极小化极大风险尺度,并通过不同模型验证了特征学习与泛化能力提升的关联性。
English: This paper introduces the effective span dimension (ESD), a novel complexity measure for spectral algorithms that depends on signal, spectrum, and noise, proving it governs minimax risk scaling and connects feature learning to generalization improvements across various models.

Authors:Dongming Huang, Zhifan Li, Yicheng Li, Qian Lin
Title: Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
Abstract:
We study spectral algorithms in the setting where kernels are learned from data. We introduce the effective span dimension (ESD), an alignment-sensitive complexity measure that depends jointly on the signal, spectrum, and noise level $σ^2$. The ESD is well-defined for arbitrary kernels and signals without requiring eigen-decay conditions or source conditions. We prove that for sequence models whose ESD is at most $K$, the minimax excess risk scales as $σ^2 K$. Furthermore, we analyze over-parameterized gradient flow and prove that it can reduce the ESD. This finding establishes a connection between adaptive feature learning and provable improvements in generalization of spectral algorithms. We demonstrate the generality of the ESD framework by extending it to linear models and RKHS regression, and we support the theory with numerical experiments. This framework provides a novel perspective on generalization beyond traditional fixed-kernel theories.
中文: 本文提出了有效跨度维度(ESD)这一新颖的复杂度度量方法,该指标综合信号、谱和噪声因素,证明了其决定极小化极大风险尺度,并通过不同模型验证了特征学习与泛化能力提升的关联性。
English: This paper introduces the effective span dimension (ESD), a novel complexity measure for spectral algorithms that depends on signal, spectrum, and noise, proving it governs minimax risk scaling and connects feature learning to generalization improvements across various models.

Authors:Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang
Title: Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving
Abstract:
End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.
中文摘要:ReflectDrive是一种基于学习的新型框架,通过离散扩散集成安全感知的反思机制,无需梯度计算即可进行迭代自我修正,为自动驾驶系统生成安全轨迹。
English Summary: ReflectDrive is a novel learning-based framework that integrates a safety-aware reflection mechanism using discrete diffusion for autonomous driving, enabling iterative self-correction without gradient computation to generate safe trajectories.

Authors:Pei Liu, Hongliang Lu, Haichao Liu, Haipeng Liu, Xin Liu, Ruoyu Yao, Shengbo Eben Li, Jun Ma
Title: OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving
Abstract:
Human vision is capable of transforming two-dimensional observations into an egocentric three-dimensional scene understanding, which underpins the ability to translate complex scenes and exhibit adaptive behaviors. This capability, however, remains lacking in current autonomous driving systems, where mainstream approaches primarily rely on depth-based 3D reconstruction rather than true scene understanding. To address this limitation, we propose a novel human-like framework called OmniScene. First, we introduce the OmniScene Vision-Language Model (OmniVLM), a vision-language framework that integrates multi-view and temporal perception for holistic 4D scene understanding. Then, harnessing a teacher-student OmniVLM architecture and knowledge distillation, we embed textual representations into 3D instance features for semantic supervision, enriching feature learning, and explicitly capturing human-like attentional semantics. These feature representations are further aligned with human driving behaviors, forming a more human-like perception-understanding-action architecture. In addition, we propose a Hierarchical Fusion Strategy (HFS) to address imbalances in modality contributions during multimodal integration. Our approach adaptively calibrates the relative significance of geometric and semantic features at multiple abstraction levels, enabling the synergistic use of complementary cues from visual and textual modalities. This learnable dynamic fusion enables a more nuanced and effective exploitation of heterogeneous information. We evaluate OmniScene comprehensively on the nuScenes dataset, benchmarking it against over ten state-of-the-art models across various tasks. Our approach consistently achieves superior results, establishing new benchmarks in perception, prediction, planning, and visual question answering.
Chinese: 提出的OmniScene框架通过融合多视角视觉语言建模与知识蒸馏,实现了类人的四维场景理解,在感知、预测和规划等任务中全面超越现有先进模型,推动自动驾驶系统发展。
English: The proposed OmniScene framework advances autonomous driving by integrating multi-view vision-language modeling and knowledge distillation to achieve human-like 4D scene understanding, outperforming existing models across perception, prediction, and planning tasks.

Authors:Sheng-Bin Duan, Jian-Long Hao, Tian-Yu Xiang, Xiao-Hu Zhou, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Zeng-Guang Hou
Title: Online Adaptation via Dual-Stage Alignment and Self-Supervision for Fast-Calibration Brain-Computer Interfaces
Abstract:
Individual differences in brain activity hinder the online application of electroencephalogram (EEG)-based brain computer interface (BCI) systems. To overcome this limitation, this study proposes an online adaptation algorithm for unseen subjects via dual-stage alignment and self-supervision. The alignment process begins by applying Euclidean alignment in the EEG data space and then updates batch normalization statistics in the representation space. Moreover, a self-supervised loss is designed to update the decoder. The loss is computed by soft pseudo-labels derived from the decoder as a proxy for the unknown ground truth, and is calibrated by Shannon entropy to facilitate self-supervised training. Experiments across five public datasets and seven decoders show the proposed algorithm can be integrated seamlessly regardless of BCI paradigm and decoder architecture. In each iteration, the decoder is updated with a single online trial, which yields average accuracy gains of 4.9% on steady-state visual evoked potentials (SSVEP) and 3.6% on motor imagery. These results support fast-calibration operation and show that the proposed algorithm has great potential for BCI applications.
Chinese: 本研究提出一种基于双阶段对齐和自监督的在线自适应算法,用于脑机接口系统,通过单次在线试验更新解码器,在稳态视觉诱发电位和运动想象任务中分别实现4.9%和3.6%的平均准确率提升,支持快速校准操作。
English: This study introduces an online adaptation algorithm for EEG-based BCIs that uses dual-stage alignment and self-supervision to improve performance across subjects, achieving significant accuracy gains in SSVEP and motor imagery tasks with fast calibration.

Authors:Dennis Gross, Helge Spieker, Arnaud Gotlieb
Title: Bounded PCTL Model Checking of Large Language Model Outputs
Abstract:
In this paper, we introduce LLMCHECKER, a model-checking-based verification method to verify the probabilistic computation tree logic (PCTL) properties of an LLM text generation process. We empirically show that only a limited number of tokens are typically chosen during text generation, which are not always the same. This insight drives the creation of $α$-$k$-bounded text generation, narrowing the focus to the $α$ maximal cumulative probability on the top-$k$ tokens at every step of the text generation process. Our verification method considers an initial string and the subsequent top-$k$ tokens while accommodating diverse text quantification methods, such as evaluating text quality and biases. The threshold $α$ further reduces the selected tokens, only choosing those that exceed or meet it in cumulative probability. LLMCHECKER then allows us to formally verify the PCTL properties of $α$-$k$-bounded LLMs. We demonstrate the applicability of our method in several LLMs, including Llama, Gemma, Mistral, Genstruct, and BERT. To our knowledge, this is the first time PCTL-based model checking has been used to check the consistency of the LLM text generation process.
Chinese: 本文提出LLMCHECKER,一种基于模型检测的方法,通过α-k有界生成策略验证LLM文本生成过程的PCTL属性,该策略将每个步骤的候选标记限定在累积概率超过阈值α的前k个标记范围内。
English: This paper presents LLMCHECKER, a model-checking method that verifies PCTL properties in LLM text generation by focusing on α-k-bounded generation, which limits token selection to the top-k tokens exceeding a probability threshold α.

Authors:Zexun Zhan, Shuzheng Gao, Ruida Hu, Cuiyun Gao
Title: SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement
Abstract:
Large language models (LLMs) have achieved remarkable progress in code generation. However, existing benchmarks mainly formalize the task as a static, single-turn problem, overlooking the stepwise requirement changes and iterative workflows in real-world software development. This mismatch limits the understanding of how well LLMs can support real-world development workflows. Constructing such iterative benchmarks is challenging due to the lack of public interaction traces and the difficulty of creating discriminative, turn-specific test cases. To bridge this gap, we present SR-Eval, a benchmark specifically designed to assess LLMs on iterative code generation under Stepwise requirements Refinement. SR-Eval spans both function-level and repository-level tasks in Python and Java, enabling fine-grained and progressive evaluation across evolving requirements. The construction of SR-Eval follows a carefully designed pipeline that first leverages a multi-agent-based requirement generation method to simulate the development process and recover the multi-round interaction process from final requirements, then employs a semantic-aware discriminative test case generation component to ensure discriminative and consistent evaluation at each turn. SR-Eval comprises 443 multi-turn tasks and 1,857 questions at both function and repository levels. Using SR-Eval, we evaluate 11 representative LLMs with three prompting strategies that simulate different usage patterns. Results show that iterative code generation under stepwise requirement refinement remains highly challenging: the best-performing model achieves only 22.67% completion rate on function-level tasks and 20.00% on repository-level tasks. We further observe that prompting strategies substantially influence performance, highlighting the need for the development of advanced methods.
中文摘要:SR-Eval是一个专门评估大语言模型在逐步需求细化下迭代代码生成能力的新基准,结果表明即使采用不同提示策略,当前最佳模型的完成率仍低于23%,突显了现实开发流程中该任务的高度挑战性。
English Summary: SR-Eval is a novel benchmark designed to evaluate large language models on iterative code generation with stepwise requirement refinements, revealing that current models achieve low completion rates of under 23% despite different prompting strategies, highlighting the challenge in real-world development workflows.

Authors:Sabri Boughorbel, Fahim Dalvi, Nadir Durrani, Majd Hawasly
Title: Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing
Abstract:
As fine-tuning becomes the dominant paradigm for improving large language models (LLMs), understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain why one model outperforms another. In this work, we use model diffing, a mechanistic interpretability approach, to analyze the specific capability differences between Gemma-2-9b-it and a SimPO-enhanced variant. Using crosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8%), multilingual capabilities (+43.8%), and instruction-following (+151.7%), while its additional training also reduces emphasis on model self-reference (-44.1%) and hallucination management (-68.5%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.
Chinese: 模型差异分析显示,经SimPO增强的Gemma-2-9b-it模型在安全机制、多语言能力和指令遵循方面显著提升,同时降低了模型自引用和幻觉管理,为超越传统基准的LLM比较提供了透明框架。
English: Model diffing reveals that SimPO-enhanced Gemma-2-9b-it significantly improves safety, multilingual ability, and instruction-following while reducing self-reference and hallucination management, offering a transparent framework for comparing LLMs beyond traditional benchmarks.

Authors:Chun Kit Wong, Anders N. Christensen, Cosmin I. Bercea, Julia A. Schnabel, Martin G. Tolsgaard, Aasa Feragen
Title: Influence of Classification Task and Distribution Shift Type on OOD Detection in Fetal Ultrasound
Abstract:
Reliable out-of-distribution (OOD) detection is important for safe deployment of deep learning models in fetal ultrasound amidst heterogeneous image characteristics and clinical settings. OOD detection relies on estimating a classification model's uncertainty, which should increase for OOD samples. While existing research has largely focused on uncertainty quantification methods, this work investigates the impact of the classification task itself. Through experiments with eight uncertainty quantification methods across four classification tasks, we demonstrate that OOD detection performance significantly varies with the task, and that the best task depends on the defined ID-OOD criteria; specifically, whether the OOD sample is due to: i) an image characteristic shift or ii) an anatomical feature shift. Furthermore, we reveal that superior OOD detection does not guarantee optimal abstained prediction, underscoring the necessity to align task selection and uncertainty strategies with the specific downstream application in medical image analysis.
中文: 本研究证明胎儿超声的分布外检测性能随分类任务显著变化,且取决于分布偏移源自图像特征还是解剖结构,同时揭示最优检测不能保证有效的弃权预测。
English: This study demonstrates that out-of-distribution detection performance in fetal ultrasound varies significantly with classification tasks and depends on whether distribution shifts stem from image characteristics or anatomical features, while also revealing that optimal detection doesn't ensure effective abstained prediction.

Authors:Guizhen Chen, Weiwen Xu, Hao Zhang, Hou Pong Chan, Deli Zhao, Anh Tuan Luu, Yu Rong
Title: GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning
Abstract:
Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.
中文摘要:本研究针对多模态大语言模型在视觉感知上的瓶颈,提出了两阶段强化学习框架,先增强几何结构的视觉感知能力再培养推理能力,在几何任务上取得显著提升,并证明该方法可推广至其他视觉密集型领域。
English Summary: This study addresses the perceptual bottleneck in multimodal large language models (MLLMs) by proposing a two-stage reinforcement learning framework that first enhances visual perception of geometric structures before developing reasoning capabilities, achieving significant improvements in geometric tasks and demonstrating broader applicability to vision-intensive domains.

Authors:Tian Lan, Jinyuan Xu, Xue He, Jenq-Neng Hwang, Lei Li
Title: Attention Consistency for LLMs Explanation
Abstract:
Understanding the decision-making processes of large language models (LLMs) is essential for their trustworthy development and deployment. However, current interpretability methods often face challenges such as low resolution and high computational cost. To address these limitations, we propose the \textbf{Multi-Layer Attention Consistency Score (MACS)}, a novel, lightweight, and easily deployable heuristic for estimating the importance of input tokens in decoder-based models. MACS measures contributions of input tokens based on the consistency of maximal attention. Empirical evaluations demonstrate that MACS achieves a favorable trade-off between interpretability quality and computational efficiency, showing faithfulness comparable to complex techniques with a 22\% decrease in VRAM usage and 30\% reduction in latency.
中文摘要:多层注意力一致性评分(MACS)通过测量最大注意力一致性来评估输入词元重要性,这种轻量级方法在保持解释质量的同时显著降低了计算成本。
English Summary: The Multi-Layer Attention Consistency Score (MACS) is a lightweight method that evaluates input token importance by measuring maximal attention consistency, offering comparable interpretability quality with significantly reduced computational costs.

Authors:Junhyeok Lee, Helin Wang, Yaohan Guan, Thomas Thebaud, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak
Title: MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
Abstract:
We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at https://maskvct.github.io/.
中文: MaskVCT是一种零样本语音转换模型,通过无分类器引导实现多因素控制,能够在保持最佳说话人相似度的同时,平衡语音内容与韵律特征,并具有竞争力的语音识别准确率。
English: MaskVCT is a zero-shot voice conversion model that enables multi-factor control through classifier-free guidance, allowing users to balance speaker identity, linguistic content, and prosody while achieving superior speaker similarity and competitive intelligibility.

Authors:Liang Heng, Jiadong Xu, Yiwen Wang, Xiaoqi Li, Muhe Cai, Yan Shen, Juan Zhu, Guanghui Ren, Hao Dong
Title: Imagine2Act: Leveraging Object-Action Motion Consistency from Imagined Goals for Robotic Manipulation
Abstract:
Relational object rearrangement (ROR) tasks (e.g., insert flower to vase) require a robot to manipulate objects with precise semantic and geometric reasoning. Existing approaches either rely on pre-collected demonstrations that struggle to capture complex geometric constraints or generate goal-state observations to capture semantic and geometric knowledge, but fail to explicitly couple object transformation with action prediction, resulting in errors due to generative noise. To address these limitations, we propose Imagine2Act, a 3D imitation-learning framework that incorporates semantic and geometric constraints of objects into policy learning to tackle high-precision manipulation tasks. We first generate imagined goal images conditioned on language instructions and reconstruct corresponding 3D point clouds to provide robust semantic and geometric priors. These imagined goal point clouds serve as additional inputs to the policy model, while an object-action consistency strategy with soft pose supervision explicitly aligns predicted end-effector motion with generated object transformation. This design enables Imagine2Act to reason about semantic and geometric relationships between objects and predict accurate actions across diverse tasks. Experiments in both simulation and the real world demonstrate that Imagine2Act outperforms previous state-of-the-art policies. More visualizations can be found at https://sites.google.com/view/imagine2act.
Chinese: Imagine2Act是一种新颖的3D模仿学习框架,它将语义和几何约束融入策略学习,通过生成想象的目标点云并协调物体变换与动作预测,使机器人能够执行高精度的关系性物体重排任务。
English: Imagine2Act is a novel 3D imitation-learning framework that integrates semantic and geometric constraints into policy learning, enabling robots to perform high-precision relational object rearrangement tasks by generating imagined goal point clouds and aligning object transformations with action predictions.

Authors:Zhao Song, Jianfei Xue, Lichen Zhang
Title: Differential Privacy for Euclidean Jordan Algebra with Applications to Private Symmetric Cone Programming
Abstract:
In this paper, we study differentially private mechanisms for functions whose outputs lie in a Euclidean Jordan algebra. Euclidean Jordan algebras capture many important mathematical structures and form the foundation of linear programming, second-order cone programming, and semidefinite programming. Our main contribution is a generic Gaussian mechanism for such functions, with sensitivity measured in $\ell_2$, $\ell_1$, and $\ell_\infty$ norms. Notably, this framework includes the important case where the function outputs are symmetric matrices, and sensitivity is measured in the Frobenius, nuclear, or spectral norm. We further derive private algorithms for solving symmetric cone programs under various settings, using a combination of the multiplicative weights update method and our generic Gaussian mechanism. As an application, we present differentially private algorithms for semidefinite programming, resolving a major open question posed by [Hsu, Roth, Roughgarden, and Ullman, ICALP 2014].
中文: 本文针对输出在欧几里得约当代数中的函数提出了一种通用的高斯机制,通过结合乘性权重更新方法,开发了对称锥规划的差分隐私算法,并解决了半定规划隐私保护中的关键开放性问题。
English: This paper introduces a generic Gaussian mechanism for differentially private functions with outputs in Euclidean Jordan algebras, enabling private algorithms for symmetric cone programs and resolving an open question in private semidefinite programming.

Authors:Zhao Song, David P. Woodruff, Lichen Zhang
Title: Sublinear Time Quantum Sensitivity Sampling
Abstract:
We present a unified framework for quantum sensitivity sampling, extending the advantages of quantum computing to a broad class of classical approximation problems. Our unified framework provides a streamlined approach for constructing coresets and offers significant runtime improvements in applications such as clustering, regression, and low-rank approximation. Our contributions include: * $k$-median and $k$-means clustering: For $n$ points in $d$-dimensional Euclidean space, we give an algorithm that constructs an $ε$-coreset in time $\widetilde O(n^{0.5}dk^{2.5}~\mathrm{poly}(ε^{-1}))$ for $k$-median and $k$-means clustering. Our approach achieves a better dependence on $d$ and constructs smaller coresets that only consist of points in the dataset, compared to recent results of [Xue, Chen, Li and Jiang, ICML'23]. * $\ell_p$ regression: For $\ell_p$ regression problems, we construct an $ε$-coreset of size $\widetilde O_p(d^{\max\{1, p/2\}}ε^{-2})$ in time $\widetilde O_p(n^{0.5}d^{\max\{0.5, p/4\}+1}(ε^{-3}+d^{0.5}))$, improving upon the prior best quantum sampling approach of [Apers and Gribling, QIP'24] for all $p\in (0, 2)\cup (2, 22]$, including the widely studied least absolute deviation regression ($\ell_1$ regression). * Low-rank approximation with Frobenius norm error: We introduce the first quantum sublinear-time algorithm for low-rank approximation that does not rely on data-dependent parameters, and runs in $\widetilde O(nd^{0.5}k^{0.5}ε^{-1})$ time. Additionally, we present quantum sublinear algorithms for kernel low-rank approximation and tensor low-rank approximation, broadening the range of achievable sublinear time algorithms in randomized numerical linear algebra.
Chinese: 本研究提出了统一的量子敏感度采样框架,显著加速了聚类、回归和低秩逼近等问题的核心集构建,相比现有方法在计算效率和核心集规模方面均实现了重要突破。
English: This study introduces a unified quantum sensitivity sampling framework that significantly accelerates coreset construction for clustering, regression, and low-rank approximation problems while achieving improved computational efficiency and smaller coreset sizes compared to prior methods.

Authors:Tianyang Xu, Hongqiu Wu, Weiqi Wu, Hai Zhao
Title: OPEN-THEATRE: An Open-Source Toolkit for LLM-based Interactive Drama
Abstract:
LLM-based Interactive Drama introduces a novel dialogue scenario in which the player immerses into a character and engages in a dramatic story by interacting with LLM agents. Despite the fact that this emerging area holds significant promise, it remains largely underexplored due to the lack of a well-designed playground to develop a complete drama. This makes a significant barrier for researchers to replicate, extend, and study such systems. Hence, we present Open-Theatre, the first open-source toolkit for experiencing and customizing LLM-based interactive drama. It refines prior work with an efficient multi-agent architecture and a hierarchical retrieval-based memory system, designed to enhance narrative coherence and realistic long-term behavior in complex interactions. In addition, we provide a highly configurable pipeline, making it easy for researchers to develop and optimize new approaches.
中文: Open-Theatre是首个开源工具包,通过高效的多智能体架构和分层记忆系统来增强基于大语言模型的互动戏剧的叙事连贯性和长期行为真实性,并为研究人员提供高度可配置的开发流程。
English: Open-Theatre is the first open-source toolkit designed to enhance LLM-based interactive drama by improving narrative coherence and long-term behavior through an efficient multi-agent architecture and hierarchical memory system, while offering a configurable pipeline for researchers.

Authors:Janak Kapuriya, Anwar Shaikh, Arnav Goel, Medha Hira, Apoorv Singh, Jay Saraf, Sanjana, Vaibhav Nauriyal, Avinash Anand, Zhengkui Wang, Rajiv Ratn Shah
Title: Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
Abstract:
In this study, we introduce Vision-Caption aware Supervised FineTuning (VCASFT), a novel learning paradigm designed to enhance the performance of smaller Vision Language Models(VLMs) on scientific visual question answering(VQA) tasks. VCASFT leverages image captions as zero-shot prompts alongside question-answer pairs and instruction-tunes models to yield significant performance improvements. To comprehensively evaluate VCASFT, we benchmark it on ScienceQA, which consists of questions across diverse languages, subjects, and fields, demonstrating its adaptability and effectiveness in a variety of educational contexts. Additionally, to further demonstrate the effectiveness of this technique on lowresource languages, we developed HiSciVQA, a dataset comprising 2,245 high-quality, hand-annotated Hindi multimodal Q&A pairs. This dataset addresses the critical need for low-resource language Q&A datasets and serves as a foundation for testing VCASFT. Additionally, we introduce a novel LLM-based evaluation scheme to evaluate VLMs on HiSciVQA which offers deeper insights into model effectiveness surpassing traditional n-gram matching accuracy metrics. We are committed to advancing the field by open-sourcing all code files and the HiSciVQA dataset for the research community.
中文摘要:本研究提出VCASFT这一新型微调范式,通过利用图像描述作为提示来提升小型视觉语言模型在科学视觉问答任务中的表现,并通过ScienceQA基准测试和全新开发的印地语数据集HiSciVQA及其创新评估方案验证了其有效性。
English Summary: This study introduces VCASFT, a novel fine-tuning method that enhances smaller Vision Language Models' performance on scientific VQA tasks by leveraging image captions as prompts, and validates its effectiveness through ScienceQA benchmarks and a newly developed Hindi dataset HiSciVQA with advanced evaluation metrics.

Authors:Gabrielle Chavez, Laureano Moro-Velazquez, Ankur Butala, Najim Dehak, Thomas Thebaud
Title: Cross-Corpus and Cross-domain Handwriting Assessment of NeuroDegenerative Diseases via Time-Series-to-Image Conversion
Abstract:
Handwriting is significantly affected by neurological disorders (ND) such as Parkinson's disease (PD) and Alzheimer's disease (AD). Prior works have analyzed handwriting tasks using feature-based approaches or computer-vision techniques, but these methods have struggled to generalize across multiple datasets, particularly between temporal features represented as time-series and images. We propose a framework that leverages both time-series and images of handwriting through a joint classifier, based on a ResNet50 pretrained on ImageNet-1k. Binary classification experiments demonstrate state-of-the-art performances on existing time-series and image datasets, with significant improvement on specific drawing and writing tasks from the NeuroLogical Signals (NLS) dataset. In particular, the proposed model demonstrates improved performance on Draw Clock and Spiral tasks. Additionally, cross-dataset and multi-dataset experiments were consistently able to achieve high F1 scores, up to 98 for PD detection, highlighting the potential of the proposed model to generalize over different forms of handwriting signals, and enhance the detection of motor deficits in ND.
中文摘要:本研究提出了一种基于ResNet50的联合分类框架,通过同时分析时间序列和图像笔迹数据,在帕金森病和阿尔茨海默病等神经系统疾病检测中实现了最优性能,并展现出卓越的跨数据集泛化能力。
English Summary: This study introduces a joint classification framework using ResNet50 to analyze both time-series and image-based handwriting data, achieving state-of-the-art performance in detecting neurological disorders like Parkinson's and Alzheimer's with high generalization across datasets.

Authors:Ritvik Singh, Karl Van Wyk, Pieter Abbeel, Jitendra Malik, Nathan Ratliff, Ankur Handa
Title: End-to-end RL Improves Dexterous Grasping Policies
Abstract:
This work explores techniques to scale up image-based end-to-end learning for dexterous grasping with an arm + hand system. Unlike state-based RL, vision-based RL is much more memory inefficient, resulting in relatively low batch sizes, which is not amenable for algorithms like PPO. Nevertheless, it is still an attractive method as unlike the more commonly used techniques which distill state-based policies into vision networks, end-to-end RL can allow for emergent active vision behaviors. We identify a key bottleneck in training these policies is the way most existing simulators scale to multiple GPUs using traditional data parallelism techniques. We propose a new method where we disaggregate the simulator and RL (both training and experience buffers) onto separate GPUs. On a node with four GPUs, we have the simulator running on three of them, and PPO running on the fourth. We are able to show that with the same number of GPUs, we can double the number of existing environments compared to the previous baseline of standard data parallelism. This allows us to train vision-based environments, end-to-end with depth, which were previously performing far worse with the baseline. We train and distill both depth and state-based policies into stereo RGB networks and show that depth distillation leads to better results, both in simulation and reality. This improvement is likely due to the observability gap between state and vision policies which does not exist when distilling depth policies into stereo RGB. We further show that the increased batch size brought about by disaggregated simulation also improves real world performance. When deploying in the real world, we improve upon the previous state-of-the-art vision-based results using our end-to-end policies.
中文: 本研究提出一种方法,通过将模拟器和强化学习过程分离到不同的GPU上,提升了基于视觉的灵巧抓取强化学习的效率,使环境数量翻倍并改善了训练效果,最终通过将深度策略蒸馏到立体RGB网络中,在仿真和现实中均取得了更优的性能。
English: This study introduces a method to enhance vision-based reinforcement learning for dexterous grasping by disaggregating the simulator and RL processes across separate GPUs, which doubles the number of environments and improves training efficiency, leading to superior real-world performance and better results through depth distillation into stereo RGB networks.

Authors:Lester Phillip Violeta, Xueyao Zhang, Jiatong Shi, Yusuke Yasuda, Wen-Chin Huang, Zhizheng Wu, Tomoki Toda
Title: The Singing Voice Conversion Challenge 2025: From Singer Identity Conversion To Singing Style Conversion
Abstract:
We present the findings of the latest iteration of the Singing Voice Conversion Challenge, a scientific event aiming to compare and understand different voice conversion systems in a controlled environment. Compared to previous iterations which solely focused on converting the singer identity, this year we also focused on converting the singing style of the singer. To create a controlled environment and thorough evaluations, we developed a new challenge database, introduced two tasks, open-sourced baselines, and conducted large-scale crowd-sourced listening tests and objective evaluations. The challenge was ran for two months and in total we evaluated 26 different systems. The results of the large-scale crowd-sourced listening test showed that top systems had comparable singer identity scores to ground truth samples. However, modeling the singing style and consequently achieving high naturalness still remains a challenge in this task, primarily due to the difficulty in modeling dynamic information in breathy, glissando, and vibrato singing styles.
中文: 最新一届歌声转换挑战赛不仅关注歌手身份转换,还引入了歌唱风格转换,通过大规模评估发现顶尖系统在身份还原上接近真实样本,但因难以模拟气息、滑音和颤音等动态特征,实现高自然度仍具挑战。
English: The latest Singing Voice Conversion Challenge expanded its focus to include both singer identity and singing style conversion, evaluating 26 systems through large-scale tests that revealed top performers matched ground truth in identity but struggled with naturalness due to difficulties in modeling dynamic vocal elements like breathiness and vibrato.

Authors:Ruiqi Wang, Dezhong Zhao, Ziqin Yuan, Tianyu Shao, Guohua Chen, Dominic Kao, Sungeun Hong, Byung-Cheol Min
Title: PRIMT: Preference-based Reinforcement Learning with Multimodal Feedback and Trajectory Synthesis from Foundation Models
Abstract:
Preference-based reinforcement learning (PbRL) has emerged as a promising paradigm for teaching robots complex behaviors without reward engineering. However, its effectiveness is often limited by two critical challenges: the reliance on extensive human input and the inherent difficulties in resolving query ambiguity and credit assignment during reward learning. In this paper, we introduce PRIMT, a PbRL framework designed to overcome these challenges by leveraging foundation models (FMs) for multimodal synthetic feedback and trajectory synthesis. Unlike prior approaches that rely on single-modality FM evaluations, PRIMT employs a hierarchical neuro-symbolic fusion strategy, integrating the complementary strengths of large language models and vision-language models in evaluating robot behaviors for more reliable and comprehensive feedback. PRIMT also incorporates foresight trajectory generation, which reduces early-stage query ambiguity by warm-starting the trajectory buffer with bootstrapped samples, and hindsight trajectory augmentation, which enables counterfactual reasoning with a causal auxiliary loss to improve credit assignment. We evaluate PRIMT on 2 locomotion and 6 manipulation tasks on various benchmarks, demonstrating superior performance over FM-based and scripted baselines.
中文:PRIMT是一种基于偏好的强化学习框架,利用基础模型提供多模态反馈和轨迹合成,通过分层神经符号融合策略减少人工输入,并借助前瞻轨迹生成和后顾轨迹增强解决查询模糊性和信用分配问题。
English: PRIMT is a preference-based reinforcement learning framework that leverages foundation models for multimodal feedback and trajectory synthesis to reduce human input and improve reward learning by addressing query ambiguity and credit assignment.

Authors:Shen Chen, Ruiyu Zhao, Jiale Zhou, Zongkai Wu, Jenq-Neng Hwang, Lei Li
Title: Causal Reasoning Elicits Controllable 3D Scene Generation
Abstract:
Existing 3D scene generation methods often struggle to model the complex logical dependencies and physical constraints between objects, limiting their ability to adapt to dynamic and realistic environments. We propose CausalStruct, a novel framework that embeds causal reasoning into 3D scene generation. Utilizing large language models (LLMs), We construct causal graphs where nodes represent objects and attributes, while edges encode causal dependencies and physical constraints. CausalStruct iteratively refines the scene layout by enforcing causal order to determine the placement order of objects and applies causal intervention to adjust the spatial configuration according to physics-driven constraints, ensuring consistency with textual descriptions and real-world dynamics. The refined scene causal graph informs subsequent optimization steps, employing a Proportional-Integral-Derivative(PID) controller to iteratively tune object scales and positions. Our method uses text or images to guide object placement and layout in 3D scenes, with 3D Gaussian Splatting and Score Distillation Sampling improving shape accuracy and rendering stability. Extensive experiments show that CausalStruct generates 3D scenes with enhanced logical coherence, realistic spatial interactions, and robust adaptability.
中文: CausalStruct通过将因果推理与大型语言模型相结合,迭代优化布局并强化物理约束,从而生成具有更强逻辑连贯性、真实空间交互和鲁棒适应性的3D场景。
English: CausalStruct integrates causal reasoning with large language models to generate 3D scenes that exhibit improved logical consistency, realistic object interactions, and adaptability by iteratively refining layouts and enforcing physical constraints.

Authors:Ju Dong, Lei Zhang, Liding Zhang, Yao Ling, Yu Fu, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang
Title: M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation
Abstract:
Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 7 to 56 percent higher success rates and reduces collisions by 3 to 31 percent over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser.
中文: M4Diffuser是一种混合框架,通过多视角扩散策略生成任务相关的末端执行器目标,并采用新型简化且可操作性感知的QP控制器进行稳健执行,在移动操作任务中实现了显著更高的成功率和更低的碰撞率。
English: M4Diffuser is a hybrid framework combining a Multi-View Diffusion Policy for generating task-relevant end-effector goals with a novel Reduced and Manipulability-aware QP controller for robust execution, achieving significantly higher success rates and reduced collisions in mobile manipulation tasks.

Authors:Jiabo MA, Wenqiang Li, Jinbang Li, Ziyi Liu, Linshan Wu, Fengtao Zhou, Li Liang, Ronald Cheong Kin Chan, Terence T. W. Wong, Hao Chen
Title: Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows
Abstract:
Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face significant challenges in clinical applications, primarily due to their reliance on well-aligned paired data. Obtaining such data is inherently difficult because chemical staining processes can distort tissue structures, and a single tissue section cannot undergo multiple staining procedures without damage or loss of information. As a result, most available virtual staining datasets are either unpaired or roughly paired, making it difficult for existing methods to achieve accurate pixel-level supervision. To address this challenge, we propose a robust virtual staining framework featuring cascaded registration mechanisms to resolve spatial mismatches between generated outputs and their corresponding ground truth. Experimental results demonstrate that our method significantly outperforms state-of-the-art models across five datasets, achieving an average improvement of 3.2% on internal datasets and 10.1% on external datasets. Moreover, in datasets with substantial misalignment, our approach achieves a remarkable 23.8% improvement in peak signal-to-noise ratio compared to baseline models. The exceptional robustness of the proposed method across diverse datasets simplifies the data acquisition process for virtual staining and offers new insights for advancing its development.
中文: 该研究提出的具有级联配准机制的鲁棒性虚拟染色框架解决了非配对数据中的空间错位问题,在多个数据集上实现显著性能提升,为临床应用简化了数据采集流程。
English: The proposed robust virtual staining framework with cascaded registration mechanisms overcomes spatial mismatch challenges in unpaired datasets, achieving significant performance improvements across multiple datasets and simplifying data acquisition for clinical applications.

Authors:Junan Zhang, Yunjia Zhang, Xueyao Zhang, Zhizheng Wu
Title: AnyAccomp: Generalizable Accompaniment Generation via Quantized Melodic Bottleneck
Abstract:
Singing Accompaniment Generation (SAG) is the process of generating instrumental music for a given clean vocal input. However, existing SAG techniques use source-separated vocals as input and overfit to separation artifacts. This creates a critical train-test mismatch, leading to failure on clean, real-world vocal inputs. We introduce AnyAccomp, a framework that resolves this by decoupling accompaniment generation from source-dependent artifacts. AnyAccomp first employs a quantized melodic bottleneck, using a chromagram and a VQ-VAE to extract a discrete and timbre-invariant representation of the core melody. A subsequent flow-matching model then generates the accompaniment conditioned on these robust codes. Experiments show AnyAccomp achieves competitive performance on separated-vocal benchmarks while significantly outperforming baselines on generalization test sets of clean studio vocals and, notably, solo instrumental tracks. This demonstrates a qualitative leap in generalization, enabling robust accompaniment for instruments - a task where existing models completely fail - and paving the way for more versatile music co-creation tools. Demo audio and code: https://anyaccomp.github.io
中文:AnyAccomp框架通过解耦旋律提取与音源伪影,突破了现有歌唱伴奏生成技术的局限,不仅在分离人声基准测试中表现优异,更在纯净录音和器乐独奏的泛化测试中实现质的飞跃,首次实现了对器乐伴奏的稳健生成。
English: AnyAccomp is a novel framework that overcomes the limitations of existing singing accompaniment generation methods by decoupling melody extraction from source artifacts, enabling robust performance on both separated vocals and clean studio recordings while uniquely extending to instrumental tracks.

Authors:Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Huaping Liu, He Wang, Li Yi
Title: Track Any Motions under Any Disturbances
Abstract:
A foundational humanoid motion tracker is expected to be able to track diverse, highly dynamic, and contact-rich motions. More importantly, it needs to operate stably in real-world scenarios against various dynamics disturbances, including terrains, external forces, and physical property changes for general practical use. To achieve this goal, we propose Any2Track (Track Any motions under Any disturbances), a two-stage RL framework to track various motions under multiple disturbances in the real world. Any2Track reformulates dynamics adaptability as an additional capability on top of basic action execution and consists of two key components: AnyTracker and AnyAdapter. AnyTracker is a general motion tracker with a series of careful designs to track various motions within a single policy. AnyAdapter is a history-informed adaptation module that endows the tracker with online dynamics adaptability to overcome the sim2real gap and multiple real-world disturbances. We deploy Any2Track on Unitree G1 hardware and achieve a successful sim2real transfer in a zero-shot manner. Any2Track performs exceptionally well in tracking various motions under multiple real-world disturbances.
中文:Any2Track是一个两阶段强化学习框架,通过其核心组件AnyTracker实现通用运动追踪和AnyAdapter提供动态适应能力,能够在真实世界多种干扰下稳定追踪人形机器人动作,并在硬件上实现了零样本的仿真到现实迁移。
English: Any2Track is a two-stage reinforcement learning framework designed to track diverse humanoid motions under real-world disturbances through its dual components, AnyTracker for general motion tracking and AnyAdapter for dynamic adaptability, achieving successful zero-shot sim-to-real transfer on hardware.

Authors:Zhikai Zhang, Jun Guo, Chao Chen, Jilong Wang, Chenghuai Lin, Yunrui Lian, Han Xue, Zhenrong Wang, Maoqi Liu, Jiangran Lyu, Huaping Liu, He Wang, Li Yi
Title: Track Any Motions under Any Disturbances
Abstract:
A foundational humanoid motion tracker is expected to be able to track diverse, highly dynamic, and contact-rich motions. More importantly, it needs to operate stably in real-world scenarios against various dynamics disturbances, including terrains, external forces, and physical property changes for general practical use. To achieve this goal, we propose Any2Track (Track Any motions under Any disturbances), a two-stage RL framework to track various motions under multiple disturbances in the real world. Any2Track reformulates dynamics adaptability as an additional capability on top of basic action execution and consists of two key components: AnyTracker and AnyAdapter. AnyTracker is a general motion tracker with a series of careful designs to track various motions within a single policy. AnyAdapter is a history-informed adaptation module that endows the tracker with online dynamics adaptability to overcome the sim2real gap and multiple real-world disturbances. We deploy Any2Track on Unitree G1 hardware and achieve a successful sim2real transfer in a zero-shot manner. Any2Track performs exceptionally well in tracking various motions under multiple real-world disturbances.
中文:Any2Track是一个两阶段强化学习框架,通过其核心组件AnyTracker实现通用运动追踪和AnyAdapter提供动态适应能力,能够在真实世界多种干扰下稳定追踪人形机器人动作,并在硬件上实现了零样本的仿真到现实迁移。
English: Any2Track is a two-stage reinforcement learning framework designed to track diverse humanoid motions under real-world disturbances through its dual components, AnyTracker for general motion tracking and AnyAdapter for dynamic adaptability, achieving successful zero-shot sim-to-real transfer on hardware.

Authors:Yunchuan Guan, Yu Liu, Ke Zhou, Zhiqi Shen, Jenq-Neng Hwang, Serge Belongie, Lei Li
Title: Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy
Abstract:
Meta-learning is a powerful paradigm for tackling few-shot tasks. However, recent studies indicate that models trained with the whole-class training strategy can achieve comparable performance to those trained with meta-learning in few-shot classification tasks. To demonstrate the value of meta-learning, we establish an entropy-limited supervised setting for fair comparisons. Through both theoretical analysis and experimental validation, we establish that meta-learning has a tighter generalization bound compared to whole-class training. We unravel that meta-learning is more efficient with limited entropy and is more robust to label noise and heterogeneous tasks, making it well-suited for unsupervised tasks. Based on these insights, We propose MINO, a meta-learning framework designed to enhance unsupervised performance. MINO utilizes the adaptive clustering algorithm DBSCAN with a dynamic head for unsupervised task construction and a stability-based meta-scaler for robustness against label noise. Extensive experiments confirm its effectiveness in multiple unsupervised few-shot and zero-shot tasks.
中文: 元学习在泛化性能上具有更紧的界,在熵受限场景下效率更高,且对标签噪声和任务异质性更具鲁棒性,因此提出的MINO框架通过自适应聚类和稳定性机制显著提升了无监督任务的性能。
English: Meta-learning demonstrates superior generalization with a tighter bound, greater efficiency in entropy-limited scenarios, and enhanced robustness to label noise and task heterogeneity, leading to the development of MINO, a framework that improves unsupervised performance through adaptive clustering and stability mechanisms.

Authors:Nenad Petrovic, Lukasz Mazur, Alois Knoll
Title: LLM-Based Approach for Enhancing Maintainability of Automotive Architectures
Abstract:
There are many bottlenecks that decrease the flexibility of automotive systems, making their long-term maintenance, as well as updates and extensions in later lifecycle phases increasingly difficult, mainly due to long re-engineering, standardization, and compliance procedures, as well as heterogeneity and numerosity of devices and underlying software components involved. In this paper, we explore the potential of Large Language Models (LLMs) when it comes to the automation of tasks and processes that aim to increase the flexibility of automotive systems. Three case studies towards achieving this goal are considered as outcomes of early-stage research: 1) updates, hardware abstraction, and compliance, 2) interface compatibility checking, and 3) architecture modification suggestions. For proof-of-concept implementation, we rely on OpenAI's GPT-4o model.
中文摘要:本文研究利用大型语言模型(如GPT-4o)自动化提升汽车系统灵活性的任务,通过三个案例研究解决维护和更新中的瓶颈问题。
English Summary: This paper investigates using Large Language Models like GPT-4o to automate tasks that enhance automotive system flexibility, addressing bottlenecks in maintenance and updates through three case studies.

Authors:Botao He, Amir Hossein Shahidzadeh, Yu Chen, Jiayi Wu, Tianrui Guan, Guofei Chen, Howie Choset, Dinesh Manocha, Glen Chou, Cornelia Fermuller, Yiannis Aloimonos
Title: NavMoE: Hybrid Model- and Learning-based Traversability Estimation for Local Navigation via Mixture of Experts
Abstract:
This paper explores traversability estimation for robot navigation. A key bottleneck in traversability estimation lies in efficiently achieving reliable and robust predictions while accurately encoding both geometric and semantic information across diverse environments. We introduce Navigation via Mixture of Experts (NAVMOE), a hierarchical and modular approach for traversability estimation and local navigation. NAVMOE combines multiple specialized models for specific terrain types, each of which can be either a classical model-based or a learning-based approach that predicts traversability for specific terrain types. NAVMOE dynamically weights the contributions of different models based on the input environment through a gating network. Overall, our approach offers three advantages: First, NAVMOE enables traversability estimation to adaptively leverage specialized approaches for different terrains, which enhances generalization across diverse and unseen environments. Second, our approach significantly improves efficiency with negligible cost of solution quality by introducing a training-free lazy gating mechanism, which is designed to minimize the number of activated experts during inference. Third, our approach uses a two-stage training strategy that enables the training for the gating networks within the hybrid MoE method that contains nondifferentiable modules. Extensive experiments show that NAVMOE delivers a better efficiency and performance balance than any individual expert or full ensemble across different domains, improving cross-domain generalization and reducing average computational cost by 81.2% via lazy gating, with less than a 2% loss in path quality.
中文: 本文提出NAVMOE分层模块化系统,通过动态加权整合专业地形模型,在跨环境泛化能力显著提升的同时,利用惰性门控机制将计算成本降低81.2%,且路径质量损失不足2%。
English: This paper introduces NAVMOE, a hierarchical modular system for robot navigation that combines specialized terrain models through dynamic weighting to enhance cross-environment generalization while reducing computational costs by 81.2% with minimal path quality loss.

Authors:Huajun Zhou, Fengtao Zhou, Jiabo Ma, Yingxue Xu, Xi Wang, Xiuming Zhang, Li Liang, Zhenhui Li, Hao Chen
Title: A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction
Abstract:
Multimodal data provides heterogeneous information for a holistic understanding of the tumor microenvironment. However, existing AI models often struggle to harness the rich information within multimodal data and extract poorly generalizable representations. Here we present MICE (Multimodal data Integration via Collaborative Experts), a multimodal foundation model that effectively integrates pathology images, clinical reports, and genomics data for precise pan-cancer prognosis prediction. Instead of conventional multi-expert modules, MICE employs multiple functionally diverse experts to comprehensively capture both cross-cancer and cancer-specific insights. Leveraging data from 11,799 patients across 30 cancer types, we enhanced MICE's generalizability by coupling contrastive and supervised learning. MICE outperformed both unimodal and state-of-the-art multi-expert-based multimodal models, demonstrating substantial improvements in C-index ranging from 3.8% to 11.2% on internal cohorts and 5.8% to 8.8% on independent cohorts, respectively. Moreover, it exhibited remarkable data efficiency across diverse clinical scenarios. With its enhanced generalizability and data efficiency, MICE establishes an effective and scalable foundation for pan-cancer prognosis prediction, holding strong potential to personalize tailored therapies and improve treatment outcomes.
中文:MICE是一种多模态基础模型,通过功能多样的专家模块和对比学习整合病理图像、临床报告和基因组数据,在泛癌预后预测中展现出卓越的泛化能力和数据效率。
English: MICE is a multimodal foundation model that integrates pathology images, clinical reports, and genomics data using functionally diverse experts and contrastive learning, achieving superior pan-cancer prognosis prediction with enhanced generalizability and data efficiency.

Authors:Zhongrui Gui, Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman
Title: Character-Centric Understanding of Animated Movies
Abstract:
Animated movies are captivating for their unique character designs and imaginative storytelling, yet they pose significant challenges for existing recognition systems. Unlike the consistent visual patterns detected by conventional face recognition methods, animated characters exhibit extreme diversity in their appearance, motion, and deformation. In this work, we propose an audio-visual pipeline to enable automatic and robust animated character recognition, and thereby enhance character-centric understanding of animated movies. Central to our approach is the automatic construction of an audio-visual character bank from online sources. This bank contains both visual exemplars and voice (audio) samples for each character, enabling subsequent multi-modal character recognition despite long-tailed appearance distributions. Building on accurate character recognition, we explore two downstream applications: Audio Description (AD) generation for visually impaired audiences, and character-aware subtitling for the hearing impaired. To support research in this domain, we introduce CMD-AM, a new dataset of 75 animated movies with comprehensive annotations. Our character-centric pipeline demonstrates significant improvements in both accessibility and narrative comprehension for animated content over prior face-detection-based approaches. For the code and dataset, visit https://www.robots.ox.ac.uk/~vgg/research/animated_ad/.
中文: 本研究提出一种音视频结合的流程,通过自动构建角色库实现鲁棒的动画角色识别,并借助带标注的新数据集生成音频描述和角色感知字幕,显著提升了动画内容的无障碍访问体验。
English: This research introduces an audio-visual pipeline for robust animated character recognition, leveraging an automatically constructed character bank to enhance accessibility through audio descriptions and character-aware subtitles, supported by a new annotated dataset.

Authors:Weihao Zhu, Long Shi, Kang Wei, Zhen Mei, Zhe Wang, Jiaheng Wang, Jun Li
Title: When MoE Meets Blockchain: A Trustworthy Distributed Framework of Large Models
Abstract:
As an enabling architecture of Large Models (LMs), Mixture of Experts (MoE) has become prevalent thanks to its sparsely-gated mechanism, which lowers computational overhead while maintaining learning performance comparable to dense LMs. The essence of MoE lies in utilizing a group of neural networks (called experts) with each specializing in different types of tasks, along with a trainable gating network that selectively activates a subset of these experts to handle specific tasks. Traditional cloud-based MoE encounters challenges such as prolonged response latency, high bandwidth consumption, and data privacy leakage. To address these issues, researchers have proposed to deploy MoE over distributed edge networks. However, a key concern of distributed MoE frameworks is the lack of trust in data interactions among distributed experts without the surveillance of any trusted authority, and thereby prone to potential attacks such as data manipulation. In response to the security issues of traditional distributed MoE, we propose a blockchain-aided trustworthy MoE (B-MoE) framework that consists of three layers: the edge layer, the blockchain layer, and the storage layer. In this framework, the edge layer employs the activated experts downloaded from the storage layer to process the learning tasks, while the blockchain layer functions as a decentralized trustworthy network to trace, verify, and record the computational results of the experts from the edge layer. The experimental results demonstrate that B-MoE is more robust to data manipulation attacks than traditional distributed MoE during both the training and inference processes.
中文: 本文提出的区块链辅助可信MoE(B-MoE)框架通过区块链技术验证和记录计算结果,解决了传统分布式专家混合模型的安全漏洞,实验证明该框架在训练和推理过程中对数据篡改攻击具有更强的鲁棒性。
English: The proposed blockchain-aided trustworthy MoE (B-MoE) framework addresses security vulnerabilities in traditional distributed Mixture of Experts systems by leveraging blockchain technology to verify and record computational results, demonstrating enhanced robustness against data manipulation attacks during both training and inference phases.

Authors:Jian Wang, Xiaofei Xie, Qiang Hu, Shangqing Liu, Yi Li
Title: Do Code Semantics Help? A Comprehensive Study on Execution Trace-Based Information for Code Large Language Models
Abstract:
Code Large Language Models (Code LLMs) have opened a new era in programming with their impressive capabilities. However, recent research has revealed critical limitations in their ability to reason about runtime behavior and understand the actual functionality of programs, which poses significant challenges for their post-training and practical deployment. Specifically, Code LLMs encounter two principal issues: (1) a lack of proficiency in reasoning about program execution behavior, as they struggle to interpret what programs actually do during runtime, and (2) the inconsistent and fragmented representation of semantic information, such as execution traces, across existing methods, which hinders their ability to generalize and reason effectively. These challenges underscore the necessity for more systematic approaches to enhance the reasoning capabilities of Code LLMs. To address these issues, we introduce a generic framework to support integrating semantic information~(e.g., execution trace) to code task-relevant prompts, and conduct a comprehensive study to explore the role of semantic information in enhancing the reasoning ability of Code LLMs accordingly. Specifically, we focus on investigating the usefulness of trace-based semantic information in boosting supervised fine-tuning~(SFT) and post-phase inference of Code LLMs. The experimental results surprisingly disagree with previous works and demonstrate that semantic information has limited usefulness for SFT and test time scaling of Code LLM.
中文: 代码大语言模型在程序运行时行为推理和语义信息整合方面存在显著缺陷,但引入执行轨迹等语义信息的新框架研究表明,这些增强对模型微调和推理的提升效果有限,与先前结论相悖。
English: Code LLMs face significant limitations in reasoning about program runtime behavior and semantic information representation, but a new framework integrating execution traces shows that such semantic enhancements offer limited benefits for fine-tuning and inference, contradicting prior research.

Authors:Wending Liu, Siyun Liang, Huy H. Nguyen, Isao Echizen
Title: A Controllable 3D Deepfake Generation Framework with Gaussian Splatting
Abstract:
We propose a novel 3D deepfake generation framework based on 3D Gaussian Splatting that enables realistic, identity-preserving face swapping and reenactment in a fully controllable 3D space. Compared to conventional 2D deepfake approaches that suffer from geometric inconsistencies and limited generalization to novel view, our method combines a parametric head model with dynamic Gaussian representations to support multi-view consistent rendering, precise expression control, and seamless background integration. To address editing challenges in point-based representations, we explicitly separate the head and background Gaussians and use pre-trained 2D guidance to optimize the facial region across views. We further introduce a repair module to enhance visual consistency under extreme poses and expressions. Experiments on NeRSemble and additional evaluation videos demonstrate that our method achieves comparable performance to state-of-the-art 2D approaches in identity preservation, as well as pose and expression consistency, while significantly outperforming them in multi-view rendering quality and 3D consistency. Our approach bridges the gap between 3D modeling and deepfake synthesis, enabling new directions for scene-aware, controllable, and immersive visual forgeries, revealing the threat that emerging 3D Gaussian Splatting technique could be used for manipulation attacks.
中文: 本文提出了一种基于3D高斯泼溅的新型深度伪造框架,能够在三维空间中实现身份保持的面部替换与重演,在保持多视角一致性的同时超越传统二维方法,并揭示了该技术可能带来的操控风险。
English: This paper introduces a 3D Gaussian Splatting-based framework for generating realistic deepfakes that preserve identity and enable full control over facial swapping and reenactment in 3D space, outperforming 2D methods in multi-view consistency while highlighting potential manipulation risks.

Authors:Yufei Tang, Daiheng Gao, Pingyu Wu, Wenbo Zhou, Bang Zhang, Weiming Zhang
Title: Beyond Sliders: Mastering the Art of Diffusion-based Image Manipulation
Abstract:
In the realm of image generation, the quest for realism and customization has never been more pressing. While existing methods like concept sliders have made strides, they often falter when it comes to no-AIGC images, particularly images captured in real world settings. To bridge this gap, we introduce Beyond Sliders, an innovative framework that integrates GANs and diffusion models to facilitate sophisticated image manipulation across diverse image categories. Improved upon concept sliders, our method refines the image through fine grained guidance both textual and visual in an adversarial manner, leading to a marked enhancement in image quality and realism. Extensive experimental validation confirms the robustness and versatility of Beyond Sliders across a spectrum of applications.
中文: Beyond Sliders提出了一种创新框架,融合GAN与扩散模型,通过细粒度对抗性指导提升图像真实感和定制化水平,在多样化应用中显著优化了图像质量。
English: Beyond Sliders introduces an innovative framework combining GANs and diffusion models to enhance image realism and customization through fine-grained adversarial guidance, significantly improving quality across diverse applications.

Authors:Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Prasenjit Dey, Ravi Kokku, Pawan Goyal, Niloy Ganguly
Title: Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents
Abstract:
In this paper, we introduce Spotlight, a novel paradigm for information extraction that produces concise, engaging narratives by highlighting the most compelling aspects of a document. Unlike traditional summaries, which prioritize comprehensive coverage, spotlights selectively emphasize intriguing content to foster deeper reader engagement with the source material. We formally differentiate spotlights from related constructs and support our analysis with a detailed benchmarking study using new datasets curated for this work. To generate high-quality spotlights, we propose a two-stage approach: fine-tuning a large language model on our benchmark data, followed by alignment via Direct Preference Optimization (DPO). Our comprehensive evaluation demonstrates that the resulting model not only identifies key elements with precision but also enhances readability and boosts the engagement value of the original document.
中文: 本文提出Spotlight信息提取方法,通过突出文档中有趣内容来生成引人入胜的叙述,采用经过DPO优化的两阶段模型,有效提升可读性和读者参与度。
English: This paper presents Spotlight, an information extraction method that creates engaging narratives by emphasizing intriguing document content, using a two-stage model fine-tuned with DPO to enhance readability and engagement.

Authors:Pengcheng Jiang, Siru Ouyang, Yizhu Jiao, Ming Zhong, Runchu Tian, Jiawei Han
Title: A Survey on Retrieval And Structuring Augmented Generation with Large Language Models
Abstract:
Large Language Models (LLMs) have revolutionized natural language processing with their remarkable capabilities in text generation and reasoning. However, these models face critical challenges when deployed in real-world applications, including hallucination generation, outdated knowledge, and limited domain expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these limitations by integrating dynamic information retrieval with structured knowledge representations. This survey (1) examines retrieval mechanisms including sparse, dense, and hybrid approaches for accessing external knowledge; (2) explore text structuring techniques such as taxonomy construction, hierarchical classification, and information extraction that transform unstructured text into organized representations; and (3) investigate how these structured representations integrate with LLMs through prompt-based methods, reasoning frameworks, and knowledge embedding techniques. It also identifies technical challenges in retrieval efficiency, structure quality, and knowledge integration, while highlighting research opportunities in multimodal retrieval, cross-lingual structures, and interactive systems. This comprehensive overview provides researchers and practitioners with insights into RAS methods, applications, and future directions.
中文摘要:检索与结构化增强生成通过动态检索外部知识并构建结构化表示,有效解决大语言模型在现实应用中存在的幻觉生成、知识陈旧等关键问题。
English Summary: Retrieval And Structuring Augmented Generation enhances Large Language Models by dynamically retrieving external knowledge and organizing it into structured representations to overcome limitations like hallucinations and outdated information.

Authors:Linhao Li, Yiwen Ye, Ziyang Chen, Yong Xia
Title: Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation
Abstract:
3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg's potential as a cost-effective yet high-performing alternative for widespread clinical application.
中文:PSP-Seg是一种渐进式剪枝框架,通过动态优化三维医学图像分割,在保持与nnU-Net相当性能的同时,大幅降低了GPU内存占用、训练时间和参数数量。
English: PSP-Seg is a progressive pruning framework that dynamically optimizes 3D medical image segmentation, achieving performance comparable to nnU-Net while significantly reducing resource usage in GPU memory, training time, and parameters.

Authors:Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, Min Yang
Title: Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing
Abstract:
Penetration testing is critical for identifying and mitigating security vulnerabilities, yet traditional approaches remain expensive, time-consuming, and dependent on expert human labor. Recent work has explored AI-driven pentesting agents, but their evaluation relies on oversimplified capture-the-flag (CTF) settings that embed prior knowledge and reduce complexity, leading to performance estimates far from real-world practice. We close this gap by introducing the first real-world, agent-oriented pentesting benchmark, TermiBench, which shifts the goal from 'flag finding' to achieving full system control. The benchmark spans 510 hosts across 25 services and 30 CVEs, with realistic environments that require autonomous reconnaissance, discrimination between benign and exploitable services, and robust exploit execution. Using this benchmark, we find that existing systems can hardly obtain system shells under realistic conditions. To address these challenges, we propose TermiAgent, a multi-agent penetration testing framework. TermiAgent mitigates long-context forgetting with a Located Memory Activation mechanism and builds a reliable exploit arsenal via structured code understanding rather than naive retrieval. In evaluations, our work outperforms state-of-the-art agents, exhibiting stronger penetration testing capability, reducing execution time and financial cost, and demonstrating practicality even on laptop-scale deployments. Our work delivers both the first open-source benchmark for real-world autonomous pentesting and a novel agent framework that establishes a milestone for AI-driven penetration testing.
中文摘要:本文提出了首个面向真实场景的渗透测试基准TermiBench,将评估目标从简单的夺旗转为完整系统控制,并开发了多智能体框架TermiAgent,通过定位记忆激活和结构化代码理解显著提升渗透能力,在性能和成本方面均优于现有方案。
English Summary: This paper introduces TermiBench, the first real-world benchmark for AI-driven penetration testing that shifts focus from simplified flag-finding to full system control, and proposes TermiAgent, a multi-agent framework that outperforms existing systems by mitigating long-context issues and enabling robust exploit execution.

Authors:Haokai Su, Haoxiang Luo, Shunpeng Yang, Kaiwen Jiang, Wei Zhang, Hua Chen
Title: LIPM-Guided Reinforcement Learning for Stable and Perceptive Locomotion in Bipedal Robots
Abstract:
Achieving stable and robust perceptive locomotion for bipedal robots in unstructured outdoor environments remains a critical challenge due to complex terrain geometry and susceptibility to external disturbances. In this work, we propose a novel reward design inspired by the Linear Inverted Pendulum Model (LIPM) to enable perceptive and stable locomotion in the wild. The LIPM provides theoretical guidance for dynamic balance by regulating the center of mass (CoM) height and the torso orientation. These are key factors for terrain-aware locomotion, as they help ensure a stable viewpoint for the robot's camera. Building on this insight, we design a reward function that promotes balance and dynamic stability while encouraging accurate CoM trajectory tracking. To adaptively trade off between velocity tracking and stability, we leverage the Reward Fusion Module (RFM) approach that prioritizes stability when needed. A double-critic architecture is adopted to separately evaluate stability and locomotion objectives, improving training efficiency and robustness. We validate our approach through extensive experiments on a bipedal robot in both simulation and real-world outdoor environments. The results demonstrate superior terrain adaptability, disturbance rejection, and consistent performance across a wide range of speeds and perceptual conditions.
中文: 本研究提出一种基于线性倒立摆模型的新型奖励机制,通过自适应平衡控制和双目标优化,使双足机器人能够在非结构化户外环境中实现稳定、感知敏锐的运动能力。
English: This study introduces a novel Linear Inverted Pendulum Model-inspired reward design that enables bipedal robots to achieve stable, perceptive locomotion in unstructured outdoor environments through adaptive balance control and dual-objective optimization.

Authors:Willy Sucipto, Jianlong Zhou, Ray Seung Min Kwon, Fang Chen
Title: A Survey of TinyML Applications in Beekeeping for Hive Monitoring and Management
Abstract:
Honey bee colonies are essential for global food security and ecosystem stability, yet they face escalating threats from pests, diseases, and environmental stressors. Traditional hive inspections are labor-intensive and disruptive, while cloud-based monitoring solutions remain impractical for remote or resource-limited apiaries. Recent advances in Internet of Things (IoT) and Tiny Machine Learning (TinyML) enable low-power, real-time monitoring directly on edge devices, offering scalable and non-invasive alternatives. This survey synthesizes current innovations at the intersection of TinyML and apiculture, organized around four key functional areas: monitoring hive conditions, recognizing bee behaviors, detecting pests and diseases, and forecasting swarming events. We further examine supporting resources, including publicly available datasets, lightweight model architectures optimized for embedded deployment, and benchmarking strategies tailored to field constraints. Critical limitations such as data scarcity, generalization challenges, and deployment barriers in off-grid environments are highlighted, alongside emerging opportunities in ultra-efficient inference pipelines, adaptive edge learning, and dataset standardization. By consolidating research and engineering practices, this work provides a foundation for scalable, AI-driven, and ecologically informed monitoring systems to support sustainable pollinator management.
中文: 本综述整合了TinyML与物联网在养蜂业中的前沿应用,通过边缘设备实现蜂巢状况监测、病虫害识别及蜂群预警,并剖析了数据匮乏、部署障碍等关键挑战与标准化机遇。
English: This survey consolidates advances in TinyML and IoT for developing low-power, edge-based monitoring systems in apiculture, addressing hive condition tracking, pest detection, and swarm forecasting while highlighting challenges like data scarcity and deployment barriers.

Authors:Mesay Gemeda Yigezu, Girma Yohannis Bade, Atnafu Lambebo Tonja, Olga Kolesnikova, Grigori Sidorov, Alexander Gelbukh
Title: Bilingual Word Level Language Identification for Omotic Languages
Abstract:
Language identification is the task of determining the languages for a given text. In many real world scenarios, text may contain more than one language, particularly in multilingual communities. Bilingual Language Identification (BLID) is the task of identifying and distinguishing between two languages in a given text. This paper presents BLID for languages spoken in the southern part of Ethiopia, namely Wolaita and Gofa. The presence of words similarities and differences between the two languages makes the language identification task challenging. To overcome this challenge, we employed various experiments on various approaches. Then, the combination of the BERT based pretrained language model and LSTM approach performed better, with an F1 score of 0.72 on the test set. As a result, the work will be effective in tackling unwanted social media issues and providing a foundation for further research in this area.
中文: 本文针对埃塞俄比亚的沃莱塔语和戈法语进行双语语言识别,通过结合BERT和LSTM模型有效应对了两种语言词汇相似性的挑战,取得了0.72的最高F1分数。
English: This paper tackles Bilingual Language Identification (BLID) for Wolaita and Gofa in Ethiopia, achieving a top F1 score of 0.72 using a combined BERT and LSTM model to address the challenge of similar words between the languages.

Authors:ULB CompGeom Group, Zachary Abel, Hugo Akitaya, Lily Chung, Erik D. Demaine, Jenny Diomidova, Della Hendrickson, Stefan Langerman, Jayson Lynch
Title: Undecidability of Tiling with a Tromino
Abstract:
Given a periodic placement of copies of a tromino (either L or I), we prove co-RE-completeness (and hence undecidability) of deciding whether it can be completed to a plane tiling. By contrast, the problem becomes decidable if the initial placement is finite, or if the tile is a domino instead of a tromino (in any dimension). As a consequence, tiling a given periodic subset of the plane with a given tromino (L or I) is co-RE-complete. We also prove co-RE-completeness of tiling the entire plane with two polyominoes (one of which is disconnected and the other of which has constant size), and of tiling 3D space with two connected polycubes (one of which has constant size). If we restrict to tiling by translation only (no rotation), then we obtain co-RE-completeness with one more tile: two trominoes for a periodic subset of 2D, three polyominoes for the 2D plane, and three connected polycubes for 3D space. Along the way, we prove several new complexity and algorithmic results about periodic (infinite) graphs. Notably, we prove that Periodic Planar (1-in-)3SAT-3, 3DM, and Graph Orientation are co-RE-complete in 2D and PSPACE-complete in 1D; we extend basic results in graph drawing to 2D periodic graphs; and we give a polynomial-time algorithm for perfect matching in bipartite periodic graphs.
中文: 对于周期性放置的三连方块,判断其能否完成平面铺砌是co-RE完全且不可判定的,而有限放置或使用骨牌时问题可判定,相关复杂性结果还推广到多种铺砌情形和周期图的研究中。
English: For periodic placements of trominoes, determining if they can complete a plane tiling is co-RE-complete and undecidable, whereas it becomes decidable for finite placements or dominoes, with related complexity results extended to various tiling scenarios and periodic graphs.

Authors:Taihui Wang, Rilin Chen, Tong Lei, Andong Li, Jinzheng Zhao, Meng Yu, Dong Yu
Title: Target matching based generative model for speech enhancement
Abstract:
The design of mean and variance schedules for the perturbed signal is a fundamental challenge in generative models. While score-based and Schrödinger bridge-based models require careful selection of the stochastic differential equation to derive the corresponding schedules, flow-based models address this issue via vector field matching. However, this strategy often leads to hallucination artifacts and inefficient training and inference processes due to the potential inclusion of stochastic components in the vector field. Additionally, the widely adopted diffusion backbone, NCSN++, suffers from high computational complexity. To overcome these limitations, we propose a novel target-based generative framework that enhances both the flexibility of mean/variance schedule design and the efficiency of training and inference processes. Specifically, we eliminate the stochastic components in the training loss by reformulating the generative speech enhancement task as a target signal estimation problem, which therefore leads to more stable and efficient training and inference processes. In addition, we employ a logistic mean schedule and a bridge variance schedule, which yield a more favorable signal-to-noise ratio trajectory compared to several widely used schedules and thus leads to a more efficient perturbation strategy. Furthermore, we propose a new diffusion backbone for audio, which significantly improves the efficiency over NCSN++ by explicitly modeling long-term frame correlations and cross-band dependencies.
中文: 该生成框架通过消除随机成分、采用优化的均值与方差调度策略以及设计新型音频扩散架构,显著提升了调度设计的灵活性,并实现了更高效稳定的训练与推理过程。
English: The proposed generative framework enhances flexibility in mean and variance schedule design while improving training and inference efficiency by eliminating stochastic components and employing optimized schedules with a new audio-specific diffusion backbone.

Authors:Vardhan Palod, Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati
Title: Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
Abstract:
Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. While these reasoning traces or Chain of Thoughts (CoTs) are correlated with performance gains, the mechanisms underlying them remain unclear. A prevailing assumption in the community has been to anthropomorphize these tokens as "thinking", treating longer traces as evidence of higher problem-adaptive computation. In this work, we critically examine whether intermediate token sequence length reflects or correlates with problem difficulty. To do so, we train transformer models from scratch on derivational traces of the A* search algorithm, where the number of operations required to solve a maze problem provides a precise and verifiable measure of problem complexity. We first evaluate the models on trivial free-space problems, finding that even for the simplest tasks, they often produce excessively long reasoning traces and sometimes fail to generate a solution. We then systematically evaluate the model on out-of-distribution problems and find that the intermediate token length and ground truth A* trace length only loosely correlate. We notice that the few cases where correlation appears are those where the problems are closer to the training distribution, suggesting that the effect arises from approximate recall rather than genuine problem-adaptive computation. This suggests that the inherent computational complexity of the problem instance is not a significant factor, but rather its distributional distance from the training data. These results challenge the assumption that intermediate trace generation is adaptive to problem difficulty and caution against interpreting longer sequences in systems like R1 as automatically indicative of "thinking effort".
中文: 中间标记生成并不能可靠地反映问题难度,因为其长度与训练数据接近度的关联大于真正的自适应计算,这对"较长推理轨迹代表更深思考"的假设提出了质疑。
English: Intermediate token generation does not reliably reflect problem difficulty, as its length correlates more with training data proximity than genuine adaptive computation, challenging the assumption that longer traces indicate deeper thinking.

Authors:Théo Matricon, Mathieu Acher, Helge Spieker, Arnaud Gotlieb
Title: Efficiently Ranking Software Variants with Minimal Benchmarks
Abstract:
Benchmarking is a common practice in software engineering to assess the qualities and performance of software variants, coming from multiple competing systems or from configurations of the same system. Benchmarks are used notably to compare and understand variant performance, fine-tune software, detect regressions, or design new software systems. The execution of benchmarks to get a complete picture of software variants is highly costly in terms of computational resources and time. In this paper, we propose a novel approach for reducing benchmarks while maintaining stable rankings, using test suite optimization techniques. That is, we remove instances from the benchmarks while trying to keep the same rankings of the variants on all tests. Our method, BISection Sampling, BISS, strategically retains the most critical tests and applies a novel divide-and-conquer approach to efficiently sample among relevant remaining tests. We experiment with datasets and use cases from LLM leaderboards, SAT competitions, and configurable systems for performance modeling. Our results show that our method outperforms baselines even when operating on a subset of variants. Using BISS, we reduce the computational cost of the benchmarks on average to 44% and on more than half the benchmarks by up to 99% without loss in ranking stability.
中文: 本文提出BISection Sampling(BISS)新方法,通过策略性保留关键测试来维持软件变体排名稳定性,将基准测试计算成本平均降至44%,半数以上场景可降低高达99%。
English: This paper introduces BISection Sampling (BISS), a novel method that reduces the computational cost of software benchmarking by strategically selecting critical tests to maintain stable variant rankings, achieving up to 99% cost reduction in over half of benchmarks.

Authors:Mengcheng Lan, Chaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, Song Bai
Title: Text4Seg++: Advancing Image Segmentation via Generative Language Modeling
Abstract:
Multimodal Large Language Models (MLLMs) have shown exceptional capabilities in vision-language tasks. However, effectively integrating image segmentation into these models remains a significant challenge. In this work, we propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem, eliminating the need for additional decoders and significantly simplifying the segmentation process. Our key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label. We first introduce image-wise semantic descriptors, a patch-aligned textual representation of segmentation masks that integrates naturally into the language modeling pipeline. To enhance efficiency, we introduce the Row-wise Run-Length Encoding (R-RLE), which compresses redundant text sequences, reducing the length of semantic descriptors by 74% and accelerating inference by $3\times$, without compromising performance. Building upon this, our initial framework Text4Seg achieves strong segmentation performance across a wide range of vision tasks. To further improve granularity and compactness, we propose box-wise semantic descriptors, which localizes regions of interest using bounding boxes and represents region masks via structured mask tokens called semantic bricks. This leads to our refined model, Text4Seg++, which formulates segmentation as a next-brick prediction task, combining precision, scalability, and generative efficiency. Comprehensive experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models across diverse benchmarks without any task-specific fine-tuning, while remaining compatible with existing MLLM backbones. Our work highlights the effectiveness, scalability, and generalizability of text-driven image segmentation within the MLLM framework.
中文: 本文提出了一种基于文本驱动的图像分割方法,通过语义描述符将分割任务转化为文本生成问题,无需特定任务微调即可在多模态大语言模型中实现领先性能。
English: This paper introduces a text-driven approach for image segmentation in Multimodal Large Language Models, using semantic descriptors to convert segmentation into text generation and achieving state-of-the-art performance without task-specific fine-tuning.

Authors:Yiwen Ye, Yicheng Wu, Xiangde Luo, He Zhang, Ziyang Chen, Ting Dang, Yanning Zhang, Yong Xia
Title: MedSeqFT: Sequential Fine-tuning Foundation Models for 3D Medical Image Segmentation
Abstract:
Foundation models have become a promising paradigm for advancing medical image analysis, particularly for segmentation tasks where downstream applications often emerge sequentially. Existing fine-tuning strategies, however, remain limited: parallel fine-tuning isolates tasks and fails to exploit shared knowledge, while multi-task fine-tuning requires simultaneous access to all datasets and struggles with incremental task integration. To address these challenges, we propose MedSeqFT, a sequential fine-tuning framework that progressively adapts pre-trained models to new tasks while refining their representational capacity. MedSeqFT introduces two core components: (1) Maximum Data Similarity (MDS) selection, which identifies downstream samples most representative of the original pre-training distribution to preserve general knowledge, and (2) Knowledge and Generalization Retention Fine-Tuning (K&G RFT), a LoRA-based knowledge distillation scheme that balances task-specific adaptation with the retention of pre-trained knowledge. Extensive experiments on two multi-task datasets covering ten 3D segmentation tasks demonstrate that MedSeqFT consistently outperforms state-of-the-art fine-tuning strategies, yielding substantial performance gains (e.g., an average Dice improvement of 3.0%). Furthermore, evaluations on two unseen tasks (COVID-19-20 and Kidney) verify that MedSeqFT enhances transferability, particularly for tumor segmentation. Visual analyses of loss landscapes and parameter variations further highlight the robustness of MedSeqFT. These results establish sequential fine-tuning as an effective, knowledge-retentive paradigm for adapting foundation models to evolving clinical tasks. Code will be released.
中文: 提出的MedSeqFT框架通过数据选择和知识蒸馏保留预训练知识,实现了基础模型在医学图像分割任务中的连续微调,相比现有方法获得了更优的性能和更强的迁移能力。
English: The proposed MedSeqFT framework enables sequential fine-tuning of foundation models for medical image segmentation by preserving pre-trained knowledge through data selection and knowledge distillation, achieving superior performance and enhanced transferability compared to existing methods.

Authors:Jerin Yasmin, Wenxin Jiang, James C. Davis, Yuan Tian
Title: Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects
Abstract:
Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks, thereby reducing the need for costly training from scratch. Their widespread adoption introduces a new class of software dependency, which we term Software Dependencies 2.0, extending beyond conventional libraries to learned behaviors embodied in trained models and their associated artifacts. The integration of PTMs as software dependencies in real projects remains unclear, potentially threatening maintainability and reliability of modern software systems that increasingly rely on them. Objective: In this study, we investigate Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models. Specifically, we seek to understand: (1) how OSS projects structure and document their PTM dependencies; (2) what stages and organizational patterns emerge in the reuse pipelines of PTMs within these projects; and (3) the interactions among PTMs and other learned components across pipeline stages. We conduct a mixed-methods analysis of a statistically significant random sample of 401 GitHub repositories from the PeaTMOSS dataset (28,575 repositories reusing PTMs from Hugging Face and PyTorch Hub). We quantitatively examine PTM reuse by identifying patterns and qualitatively investigate how developers integrate and manage these models in practice.
中文: 本研究通过混合方法分析401个GitHub仓库,考察开源软件项目如何将预训练模型作为"软件依赖2.0"进行管理,重点关注其集成模式、文档实践和流水线结构。
English: This study investigates how open-source software projects manage pre-trained models as "Software Dependencies 2.0" through mixed-methods analysis of 401 GitHub repositories, examining their integration patterns, documentation practices, and pipeline structures.

Authors:Qingyuan Li, Binchang Li, Cuiyun Gao, Shuzheng Gao, Zongjie Li
Title: Empirical Study of Code Large Language Models for Binary Security Patch Detection
Abstract:
Security patch detection (SPD) is crucial for maintaining software security, as unpatched vulnerabilities can lead to severe security risks. In recent years, numerous learning-based SPD approaches have demonstrated promising results on source code. However, these approaches typically cannot be applied to closed-source applications and proprietary systems that constitute a significant portion of real-world software, as they release patches only with binary files, and the source code is inaccessible. Given the impressive performance of code large language models (LLMs) in code intelligence and binary analysis tasks such as decompilation and compilation optimization, their potential for detecting binary security patches remains unexplored, exposing a significant research gap between their demonstrated low-level code understanding capabilities and this critical security task. To address this gap, we construct a large-scale binary patch dataset containing \textbf{19,448} samples, with two levels of representation: assembly code and pseudo-code, and systematically evaluate \textbf{19} code LLMs of varying scales to investigate their capability in binary SPD tasks. Our initial exploration demonstrates that directly prompting vanilla code LLMs struggles to accurately identify security patches from binary patches, and even state-of-the-art prompting techniques fail to mitigate the lack of domain knowledge in binary SPD within vanilla models. Drawing on the initial findings, we further investigate the fine-tuning strategy for injecting binary SPD domain knowledge into code LLMs through two levels of representation. Experimental results demonstrate that fine-tuned LLMs achieve outstanding performance, with the best results obtained on the pseudo-code representation.
中文: 本研究通过构建大规模二进制补丁数据集,证明了经过微调的代码大语言模型(尤其是使用伪代码表示时)在二进制安全补丁检测中表现卓越,有效弥补了闭源软件安全补丁检测的研究空白。
English: This study addresses the gap in detecting security patches for closed-source software by constructing a large binary patch dataset and demonstrating that fine-tuned code LLMs, particularly using pseudo-code representations, achieve superior performance in binary security patch detection compared to direct prompting methods.

Authors:Mahsa Paknejad, Parisa Fard Moshiri, Murat Simsek, Burak Kantarci, Hussein T. Mouftah
Title: On-Dyn-CDA: A Real-Time Cost-Driven Task Offloading Algorithm for Vehicular Networks with Reduced Latency and Task Loss
Abstract:
Real-time task processing is a critical challenge in vehicular networks, where achieving low latency and minimizing dropped task ratio depend on efficient task execution. Our primary objective is to maximize the number of completed tasks while minimizing overall latency, with a particular focus on reducing number of dropped tasks. To this end, we investigate both static and dynamic versions of an optimization algorithm. The static version assumes full task availability, while the dynamic version manages tasks as they arrive. We also distinguish between online and offline cases: the online version incorporates execution time into the offloading decision process, whereas the offline version excludes it, serving as a theoretical benchmark for optimal performance. We evaluate our proposed Online Dynamic Cost-Driven Algorithm (On-Dyn-CDA) against these baselines. Notably, the static Particle Swarm Optimization (PSO) baseline assumes all tasks are transferred to the RSU and processed by the MEC, and its offline version disregards execution time, making it infeasible for real-time applications despite its optimal performance in theory. Our novel On-Dyn-CDA completes execution in just 0.05 seconds under the most complex scenario, compared to 1330.05 seconds required by Dynamic PSO. It also outperforms Dynamic PSO by 3.42% in task loss and achieves a 29.22% reduction in average latency in complex scenarios. Furthermore, it requires neither a dataset nor a training phase, and its low computational complexity ensures efficiency and scalability in dynamic environments.
Chinese: 本研究针对车载网络中的实时任务处理,提出了在线动态成本驱动算法(On-Dyn-CDA),该算法无需训练且计算效率高,在降低任务延迟和丢失率方面显著优于现有方法。
English: This study introduces the Online Dynamic Cost-Driven Algorithm (On-Dyn-CDA) for real-time task processing in vehicular networks, which significantly reduces latency and dropped tasks while requiring no training and minimal computational time compared to existing methods.